Install our extension to search inside any video instantly.

Aishwarya Srinivasan, Fireworks AI: AI Agents in Production: Systems Design, Tuning, and Deployment
Added: 2026-05-13

314 views1027:26aiconferenceOriginal Release: 2026-05-06

Srinivasan effectively strips away the generative AI hype, reframing autonomous agents as a rigorous systems engineering challenge rather than just clever prompting. It is a rare, pragmatic blueprint for anyone looking to move beyond experimental demos into the messy reality of production-grade deployment.

[00:00:04]So next we have Aishwari Shini Vasan, head of AI developer relations and growth at Fireworks AI with AI agents in production system design, tuning and deployment. So welcome Ash. Come on up.

[00:00:22]>> Well, hi everyone. I would before I get into the AI agents part I would like to ask a question to everyone here.

[00:00:29]Who here has used either chat GPT or Claude on a daily basis? I'm going to personally find you if you don't raise your hands and ask you why. I hope it's 100% of hands up. Okay. So, who here thinks that Chad GPT or Claude is not an AI agent? Raise your hands.

[00:00:51]Okay.

[00:00:53]Some hands. Okay. Whom your thinks that it is an AI agent?

[00:00:58]Okay, not everyone is raising their hands, but okay, somebody's confused.

[00:01:03]Well, the answer is that it is an AI agent and I'll get into why it is an AI agent and how it actually works.

[00:01:10]But before we get started, I'm currently leading developer relations at Fireworks AI. I come from a machine learning and data science background and I have been building machine learning models way before it was school genai. So I have seen machine learning operations transform into what it is LLM ops today.

[00:01:29]So I'm going to be sharing some of the best practices, some of the u toolkits that you need and also talk about how fireworks AI help you build and scale these AI agents because it's completely different when you're using frameworks and building them in development environment on your laptop on a cloud versus how they run in production systems.

[00:01:51]So the first thing I want to start with is it does appall me every time I open the model atlas to see how many new open source models comes up and it is equally tiring for me because every time a new open source model comes up we are responsible for hosting them on day zero on fireworks AI. So what's really interesting is it's not just the open-source models that are being released by these frontier model companies. It is also the versions of these models that go up on hugging phase whether it be a fine-tuned model for a specific use case or a distilled model, a student model which is being better at specific tasks.

[00:02:29]And one of the biggest reinforced theories that comes out is with OpenAI also launching their open source models.

[00:02:38]A company that initially did start with open source models then got into proprietary model and then again started launching open source models. that does reinforces the fact that there is a very very strong foundation of why open source models is going to be the future and how we are going to be very very focused on building our toolkits using open source models.

[00:02:58]The other very interesting things that comes up and something that I often get all the time is why is open source model better than proprietary models? If open source models are good enough then why do we even have proprietary model market in the in out there? So the simple question is at one point it was all about model quality. The open source models were good but they were not at the top of the benchmarks and that's the reason why proprietary models existed and why everybody was using proprietary models. But now this is a very nice diagram that uh Deepseek had published when they launched V3 and that was an inflection point that showed that when it comes to a cost versus quality perspective of running a model in production, open-source models have actually crossed the point where they are working better than proprietary models out there. After this came Deep Seek R1 which was the first open source reasoning model out there and ever since every week I've stopped counting how many new open source models comes out.

[00:04:01]The latest being the Kimmy K2 0905 that came out. I really wish we can update these u these diagrams to like uh reflect the latest models out there.

[00:04:13]Now the part where I wanted to talk to you about is the AI tools that we using agentic or not.

[00:04:21]It's very much in the past that we are looking at AI toolkits or generative AI toolkits as a model perspective. It is not a API call to a model and getting a response and the end of it. It's not one shot anymore. We are having end-to-end agentic systems. And this is the simplest way that I could draw and like show what are the design patterns for agents. The first being reflection. So a simple example when you're using chat GPT when you're asking it a question it thinks for a while. It uses parts of it thinking mode versus non-thinking mode.

[00:04:53]Verifies the output that it's going to give you and then gives you the output.

[00:04:56]That's reflection.

[00:04:58]Second part is tool use. Every time that you ask any of these tools a question, it's not just repurposing the text content that it's been trained on. It's using other APIs and toolkits that it has access to to give you the final response that you're looking for. If any of you have not tried this before, like I use this one feature on chat all the time is that if I'm asking for a certain structure, I don't ask it just as a table. I ask it to dump it in a CSV and give me a downloadable link and it does do that for me. So it is using a lot of multiple layers of tool calls underneath and using all of these APIs in order to give us the output in the right modality that we're looking for.

[00:05:38]The third thing is planning. A lot of times we are not very explicit with what we are looking for and we give it a vague goal. In those situations, we are not really telling it that hey do this where whereas we are giving it a multi-step thinking approach that it needs to take in so in for certain level of information and then give us the final output that's also something that we need to build into an AI agent. So React is one of the examples if you have not read that it's an amazing paper that I would highly recommend uh people go and read it that builds in how a model reasons and acts upon those reasoning when it's trying to plan a certain task.

[00:06:18]The final thing is multi- aent systems.

[00:06:21]I really hope you are able to see that diagram behind me because I'm going to get deeper into that later. So that basically shows how you divide a certain task into multiple different layers among not just one model but hopefully using multiple different models depending on what task are you trying to achieve. Some of them could be smaller models, some of them could be larger models, some of them could be multimodal, some of them could be image and etc. Depending on how you break down that task.

[00:06:49]Now it is a conflicting interest. A lot of the research labs do not like the concept of having a primary and secondary agent because they feel that for situations where you have two or three sub agents do something which is similar or overlapping there could be a conflicting output that they might give and then you'll have to sit and think about how do you resolve that conflict.

[00:07:12]So it's mostly preferred for parallel tasks where they're not doing anything that's overlapping to each other.

[00:07:21]Now, when you're designing an AI agent, whether it be a single AI agent or a multi- aent system, there are seven core components which pretty much stays the same. The first thing being perception, which is how your model understands your query, how it understands the context you're providing, and how it understands, let's say, the screenshots that you're uploading or the documents that you're throwing at it. The second is reasoning. Now reasoning can be done in multiple different ways. There are basic techniques like chain of thought.

[00:07:56]There are more advanced techniques like tree of thought or graph of thought. Or it could also be using multiple different chain of thoughts and then aggregating them into a final response.

[00:08:07]But with reasoning, what you're trying to make the model do is break down how it's thinking and approaching a final answer and then reiterate if that direction is right or not. If you have not seen reasoning in action, one of the good examples that I would tell you to do is with any of the uh chatbots as well, you should have like a thinking tag that you can open up and you can see how the model thinks through the solution. It's a very interesting thing that you could try out or um I'm going to show you like a QR code to get $50 of credits on fireworks as well. So that's also something that you can do there and use one of the reasoning models out there and see open up the thinking tag and see how it reasons through the output.

[00:08:49]The third component is memory and this is something which is super critical because any of the models that we have out here has 128,000 context window. The recent K2 that came out has 256. The Llama 4, uh, the smaller one has 1 million and the bigger one has 10 million, but it's not being used as much. But any of these models that you see, they do have a limited token window. Even if you think about a model with 1 million context window, it's not really a lot because as soon as you start incorporating textbooks or large code bases, you're going to very soon exhaust that out. But there is information that you want the model to remember. There are patterns that you want the model to remember both in short term and long term. And those are the two memory components that come with any AI agent. And it is extremely important that you build all of these several components around the model in order to have a very reliable and robust system.

[00:09:49]So it's very important that we understand that we are moving from a oneshot LLM call to a sophisticated system. It contains LLM as a brain for it. But then there is other engineering aspects of it which makes it an end-to-end system for itself.

[00:10:07]The fourth part is planning. I did explain like how how planning works. So that's another important part that goes into any agentic system which is you need to teach the model how to plan and how to break down multiple tasks to reach the goal that the user is expecting.

[00:10:23]Then we have tool use which is what are the different kinds of levers that you can give to the model to actually go and do things rather than just responding in text. Can it go and actually make certain API calls? Can it pull information from external sources? Can it pull information from internet? Can it create files in multiple different formats? Can it go pull information from your email? Can it go pull information from your calendar etc? Then the next component is learning. any AI agent, the first version of it that you're going to build is not going to be good enough. It needs to keep evolving and it needs to keep learning from the user responses.

[00:11:02]It is something that also depends on how the user interacts with these models because the way that person one would be interacting with charge GPT and asking the kind of responses that they are looking for would be very different from a person two interacting with the same tool and expecting different kinds of responses.

[00:11:21]So each and every model or each and every agent that you're individually building needs to accommodate to the use case that it's being put in front of.

[00:11:30]The last part is communication also a crucial component of building this system. And some of the common protocols out there is MCP and agent to agent protocol that was open sourced by Google. And these really help each and every sub agent if you have a multi- aent system to communicate with each other. It is important for you to have log traces in order to understand that hey at what step is my system failing.

[00:11:55]Am I not getting the right response from the model? Is it breaking at a tool call? Is it breaking at planning? Where exactly do we have points of failure?

[00:12:04]And then you can build guardrails around that. You can build fallback options behind that.

[00:12:11]Now the interesting thing is when we move from a single model or single agent to a multi- aent system this becomes even more complicated because now it's not just one model communicating with several things and a user and trying to learn and maintaining certain protocols at the same time but you have multiple different models and multiple different smaller systems trying to communicate with each other. So it's more of an engineering problem rather than just like the model intelligence at this point in time. And that's why the LLM ops is something which is super critical and how you scale it out is super critical.

[00:12:49]I'm not going to go deeper into the different components of it because it's kind of self-explanatory about like the planner agent, the retriever agent, the executor agent. Critic agent is something that acts as LLM as a judge and you use that critic agent to test if the response from a particular agent is right or not. And then finally we have the memory agent which which acts as a manager for the memory tools that you're using.

[00:13:15]Now if you think about how are these AI agents being built in practice truth be told I am seeing them mostly being used as vertical agents and it's a very trivial answer on why is because it's very hard to justify a return on investment into building these agents and second it becomes extremely hard because of the nature of generative AI models to build a deterministic response from these systems. So, I was just having a conversation with one gentleman this morning about automating code reviews. It is such a complicated topic if you think about it because it's not just about how much time it saves for you as an engineering team or an as an engineering leader on how you can automate that code review, but it is also about what is the level of risk that you're ready to take. How do you evaluate at some given points? Sure, out of 100 code reviews that a human is doing versus a model is doing, the model is perfecting 80 of them, but what about the rest 20 of them that could actually lead to critical failures which a human reviewer could have caught, but now that your AI agent has not been able to catch it, it's out in production and probably breaking 10 other things. So quantifying that return on investment and building a deterministic system around it is really hard. Building that eval around it is really, really hard. And that's why I'm seeing more vertical agents which are being fine-tuned for very specific use cases being built in industry.

[00:14:50]So as I was talking about having agents and being scaled across multiple different dimensions, right? We are not at a stage where we are using only one model in an agentic AI system. We are using multiple different models. We are customizing the models. We are fine-tuning the models. We are using a mix of small language models. We're using voice agents. It's a very very complicated system. And coming back to what I was initially explaining is that it's not an intelligence of the model anymore. It is about how do you build that system from a reliability perspective.

[00:15:26]And the way that I would break out this entire concept of like you know building a robust AI agent is that it's not just about how do you measure the model's performance on benchmarks. It is how you see the latency of the end to end of that particular system. If you have multiple different points of communication how do you measure the end toend latency?

[00:15:51]At one point when I'm speaking with product teams um we have been working with teams like cursor their fast applies running on fireworks door dash Uber uh Samsung Upwork all of these companies at the end of the day for them when they are catering to customers or even like when they have when they're having agent to agent communication latency becomes extremely important that sort of becomes a reflection of what is the quality of the product that you're delivering. Anybody here who used operator when open AI launched it? Did okay.

[00:16:28]Wasn't it horribly slow?

[00:16:31]Right. And now I when I do use the agent mode on on chat GPT I feel that it's really really fast compared to what the version one of operator looked like. So at the end of the day when you're trying to use any of these systems in production it becomes important that you're thinking about latency. you're thinking about the downstream to that application that it's being deployed at.

[00:16:51]The second thing is as in when you're scaling, you need to think about how much volume of requests can it take at any given point in time. Because sure, it works great when you're sending it 10 requests, 100 requests, thousand requests. But how does it scale when you're trying to run it across multiple different regions, when you have users across different countries and they are probably hitting your APIs at the same time. So these are again cloud basics.

[00:17:16]You know it's nothing to do specifically to AI agents. These are cloud basics and how your operation pipeline should look like.

[00:17:23]Then tailored infrastructure is obviously important because you're trying to scale it. You want to make sure that it is reliable and it is active at most most point in time. It doesn't have failures uh failure modes.

[00:17:34]One of the important things that I feel is very very ignored is user perception and trust.

[00:17:41]How many of you really understand how your data is being used when you interact with any of these chat tools?

[00:17:51]You do understand? Okay. Anybody else who understands exactly how your data is being used?

[00:17:57]Now, as users, you might be thinking that okay, like I'm not putting in any personal information. You might think that. I'm not getting to the point if you're doing that or not. But imagine how this would scale up when you start building this in an enterprise. You have so many users using these tools at the same time and you have so many points of failure of how any PI information or any confidential information can pass through this.

[00:18:24]Enterprises are obviously very strict with their policies around how their data leaves their environment and that's something that we take very seriously at fireworks and we have built that zero data retention policy where none of the data that's coming through our API endpoints for the large language models are stored with us. We are not using it to fine-tune the models or do anything with it. So I think that's something which is very very important when you're building these toolkits at scale.

[00:18:52]Now let's think about the journey of how we go about building these AI agentic systems at scale. We start with experimentation. Then we go into customizing these models. Once you have like a reasonably good enough benchmark, then you think about composibility of it. How do you include more toolkits?

[00:19:10]How do you include more modalities of uh input and output into your model? And then finally, think about scaling. And I'll also tell you why this is not how you should be thinking about it at the very end.

[00:19:23]So from the build perspective, you're thinking about, hey, what should be my life cycle of experimenting with these toolkits, right? As I said, every single week we have new open source models come out. And one of the common questions that I get from customers all the time is that how do I keep up with it? Like when do I know that I need to switch out one model from another? Like all of these models are in hundreds of billions of parameters. Should I be changing them often? Should I be not changing them often? How do I test them out if they're good enough or not? And one of the challenges that also comes with the generative nature of these models is that they're very hard to quantify.

[00:19:59]I will give you a solution for that also at the very end. But evaluating your models not just from how it works on the benchmarks that the company is producing or the frontier labs are producing.

[00:20:11]Comparing that with how does it actually work for your use case and how are the users who are interacting with your tool feeling about the quality of the model is completely different and that's why you need to have experimentation loops that help you understand build eval tests and continuously test this not just when you're developing them but also when they are online when they are live and you still keep uh monitoring them in production.

[00:20:37]Now the second thing is also super critical and I'll tell you a fun story is that a lot of the customers that I'm working with I see that after fine-tuning the models on the out ofthe-box benchmark out of the box open source models that they're having with the data set that they have not all of them are also using the actual data set even with synthetically generated data set the performance that they have seen with the open-source models which are fine-tuned is way superior to the benchm benchmarks that you see for proprietary models and that's why you see notion AI running their models on us because they have been able to customize it. You see cursor running their fast apply their code completion on us because they are able to customize it. There are two different ways of customization. I don't want to get into the details of it but I will share a blog at the very end that you can read uh which talks about how the customization works both from supervised fine-tuning and reinforcement fine-tuning perspective. Now the third part is composability. How do you include more modality to it both from an input and output perspective? Because we as humans don't really communicate or enterprises don't just communicate in text formats. There's documents, there's PDFs, there's screenshots, there's images, there's GIFs, there's so many different modalities and having all of them work together is something which is super super critical.

[00:21:59]Now a fun metric that I want to share with you is right now we are processing approximately 10 trillion tokens a day.

[00:22:08]That's the amount of requests that we are getting on the platform where we are hosting our large language models.

[00:22:15]And the way that we are working and scaling it out and that's where I was coming at is the way that we are scaling it out is we are not dependent on one provider, one cloud provider. We are in fact working with eight different cloud providers including GCP, Amazon and a bunch of other neo clouds and it's scaled across 18 different regions and that's why you get a very very reliable system when you are running the models on fireworks.

[00:22:41]Another fun story is the way that we look at optimizing inference for these large language models comes with this terminology called 3D optimizer which optimizes these models across speed, quality and cost. Because at any given point in time, not everybody's looking for all the three. In certain use cases, you're like, okay, like speed could be my P1. It could be something I can slightly compromise on, but I want to stay within this budget. The quality could be something which is non-negotiable. On the other hand, in some cases, speed is a non-negotiable where you're like, okay, I'm okay with an increased cost, but my speed has to be has to be this.

[00:23:23]And that is something that our team is an expert at at building because our team comes from Meta who are the core builders and maintainers of PyTorch and they have scaled this system at Meta and from Google also where I come from u and where I met the co-founder of the company is how we are able to scale these large language models across different infrastructure and scale it according to the weights of that particular model and customize it for that particular model. This is also a blog that's available on our website if you want to read more about it.

[00:23:58]Now, I want to also mention this if you've not if you're not familiar with this. When you talk about a model latency, it's much more than just one metric of model latency. There are more critical components that come to it.

[00:24:12]It's not just the output speed that you're looking at. You're looking at the time to first token, the time to last token, the overall variance of how that output speed looks like. So these are some of the metric that is on a third party website called artificial analysis and they have been ranking us the top fastest open-source model provider plus we were also ranked the number one inference provider on hugging phase.

[00:24:38]So as I mentioned like we have been working with a ton of top enterprise as well as the genai startups out there and we have been helping them optimize their agentic AI stack and building it in a scalable manner and all of the reasons that I mentioned because it is open-source models we have the ability to customize it to the very use case that you're looking for. We have a team of AI engineers who work with you in order to customize it for your use cases. build synthetic data if your if your company has policies against sharing the data and make sure that it is optimized at the inference level for the use cases or like for the needs that you have in in particular.

[00:25:18]Now the very last thing is you can take a picture and like uh scan the click on the QR code later but there was a lot of things that I wanted to discuss so thought it's easier to just share some interesting blogs. So, eval protocol is our open-source protocol that we have shared to help you build your evaluation sets and uh how you evaluate these models. It comes with a beautiful dashboard so you can track everything and you can actually see at every step of how the model is performing. The second is around model customizations.

[00:25:46]We recently also launched our vision fine-tuning. So, you can fine-tune your vision models and that blog will give you more details around it. And the last thing is a very interesting tool that we recently launched. It's called chat app.

[00:25:58]It's chat.fireworks.ai.

[00:26:01]You choose two models and you enter a query and you can in real time see how the response compares. So it's a very qualitative way of you judging how different model responses compare. So it's a cool thing you can play around with. And if you're interested to start trying out different models which is available, you can fill this out and you'll get a $50 credit. So you can go test out the models. $50 is a lot of uh credits for you to test out the models.

[00:26:31]Our cost is like um just under a dollar for a million tokens. So yeah, depending on the model that you choose, you should be able to run a lot of uh inference on it. And if you are a startup, we recently launched our startup program where we are offering credits to startups. And if you want to be part of the program and be considered for it, you can go and apply here.

[00:26:56]And as I wrap up, I'm sure we don't have time for questions. So you can reach out to me and uh happy to answer any of the questions that you have. I will just leave you with a thought that scalability is not something that is an afterthought. It should be something that you should be thinking about from day one if you're planning on building an agent system. That's all I had. Thank you.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trending

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29

Artificial Intelligence

DOOM Runs On Everything...except Neo Geo

ModernVintageGamer

143K views•2026-06-01