Observability for AI agents requires three pillars—logs, traces, and metrics—to enable accountability, debugging, and continuous improvement. Logs capture events like tool calls and reasoning, traces provide end-to-end records of agent runs, and metrics track system health and performance over time. Evaluation involves creating datasets with inputs and ground truth outputs, then running scores (both deterministic and LLM-based judges) to measure agent quality. This creates a feedback loop where developers can identify failure modes, make targeted improvements to system prompts or agent design, and systematically improve agent behavior over time.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Monitor, Debug, and Evaluate Agents with Mastra Observability and StudioAdded:
Hello everyone and welcome to this workshop in which you will learn how to monitor, debug, and evaluate your master agents using a combination of master observability and master studio.
If you've been to a master workshop before, then welcome back. We're glad to see you. And if you're new here, it's fantastic that you're joining us. We host these sessions every week to help you build more capable agents. We do that by showing you the latest features in master that you can utilize, as well as sharing our expertise as the team behind the framework, often times helping our customers and users build advanced agents. We're in this kind of unique vantage point where we get to see the patterns that work and then teach them to you in these workshops. And while there's a lot of things happening in the world of agents and the agentic developments, it can feel a bit like it's hard to keep up and like things are a bit all over the place. That's why we like to host these workshops regular like clockwork at the same time and the same place every Thursday. You can see a list of upcoming workshops at master.ai/workshops.
I hope you register for future ones and then next time I can welcome you back.
In this workshop, you're going to learn four fundamental things. We're going to first touch on why observability is essential. I'm going to make the case that it's not plausible to launch a production agent without observability.
Then we're going to look at how to monitor and debug your master agents using master framework's observability system, as well as master observability, which is part of platform. And if that distinction sounds a bit unclear, we're going to make that crisp by the end of this workshop as well. We're also going to touch on how to score your production traces. Another word for score might be eval using studio. And then finally, we're going to bring everything together, use the framework in conjunction with studio, in conjunction with observability, and show you what a real production feedback loop looks like.
The really cool thing about coming to a workshop like this that's live compared to reading an article or watching a YouTube video is that we get to interact with one another. So, if you haven't already, I hope you say hello in the chat.
Hey Thor, good to see you, and Michael from Vancouver. Welcome indeed. And let us know your questions throughout. We often find that we can give the best answers when the question is kind of close to the thing we're presenting, then maybe we can even steer the workshop and update our demo to show you exactly what you're trying to figure out.
Hey Nick, glad to see you.
I'm your host, Alex Burke. I look after developer experience and education at Muxter. I'll be your host today, and I'm joined by Joel Smith. Hey Joel, welcome to the workshop, and I think this is your first appearance in like the Muxter universe of YouTube and workshops and things. It's great to have you.
I Yeah, I think that's right. Thanks for having me, Alex. I'm excited to be here.
Joel is our head of platform, spearheading a lot of our new products that you might have seen, and has a wealth of experience across different products and developer tools over the last decade and more.
Joel, you also worked at Gatsby many moons ago, which I think is so interesting because Muxter is founded by the Gatsby team. So, many of the original Gatsby team came to Muxter in the first year or two to help get it off the ground. And even still, it feels like more people from the Gatsby world have joined Muxter to help take this to the next level. I'm I'm kind of curious, like what was your perception of Muxter before you joined, and what made you excited to join the team and work on the platform?
Yeah, great great question, Alex. Um I think, you know, it's uh my my initial reflection there is just it's I feel pretty lucky to get kind of to do a second run on something that I really enjoyed doing. Uh working with the Gatsby team at such an interesting moment in the kind of front-end space where there was a lot of paradigms being built. Um you know, server versus static, all DX, preview URLs, all of that stuff was really kind of in that moment and we we got to kind of participate in that, push the envelope on a lot of interesting stuff. Um so, I think, you know, my perception from the outside was, you know, what a what a great team, what a kind of bringing bringing the band back together to play to play some music again, uh if you're if you're into that sort of thing.
Uh and so, my my perception was that there's kind of a lot of similarity between providing a framework that gives a ton of value to an open source community and then working with that community to figure out like what are the additional services and products that you need to be successful when you're using that framework. So, really kind of exciting to be able to kind of do that again, really a kind of another paradigm moment in kind of the software development world where we're we're seeing kind of every um every month, maybe every week, maybe every day, kind of a a new pattern emerge, you know, chatbots to agents to harnesses, loops and heartbeats and kind of all of the interesting stuff that's come out really even last 6 to 9 months.
Um so, it's cool to be building in in that kind of frontier space where we're all really figuring out how to build these things and make production agents that work and and do interesting things and are increasingly more autonomous and powerful and um really really kind of the sky's sky feels like the limit right now. So.
Even though it's a whole new space, that does feel like a similarity, right?
Saying that Gatsby and Maestro are both frameworks.
And I I never honestly thought about it this way until Sam made this points, which is that in a brand new world where things are always moving and you don't even necessarily know what's at your disposal, a framework is kind of an education tool, right? Because it brings in all the core building blocks you need and then gives you a white space to fill in with your own logic. And I think that's the way in which Gatsby and Master feel most similar. And then, of course, once you've built the thing, you also need a way to like deploy and operate and improve the thing. And while Master historically has been focused so far on the framework part, we're now venturing into building a platform alongside the framework. And that's a very exciting idea.
Yeah. Yeah. Yeah, absolutely. Yeah, that's exactly right, which is and you know, part of what we'll talk about later with kind of observability in the framework and observability on the platform. So, we really try and take the same approach that the framework takes, where you've got all these really powerful composable pieces. So, you can you can run just the agent part of Master inside another framework or another server um uh server paradigm.
You can take any any piece of it, the the um the studio, any of that and kind of run it independently. Similarly with observability, you can run that kind of autonomously, you can put that in whatever stack you want to, hook it up to whatever um hotel database you want and um it just works. So, um I think that we think about the platform in a similar way, which is you can really pick the pieces off you need, gateway, um memory in gateway, observability, server studio, kind of whatever piece you need that is not part of your core stack, like we'll handle that for you. So, really kind of like a a buffet of options there.
Well, let's get into some demos. I'm excited to show people and bring this to life a bit. By the way, Michael wrote that personal website still runs on Gatsby. Thanks for Thanks for joining us in the nostalgia.
I'm not sure if I would >> [laughter] >> I suppose it might need an update. I'm not sure.
It likely does.
Okay, so observability. I told you I was going to make the case that it's essential.
And I want you to look at these questions and think, suppose you're working on an agent and you ship it into production and a user or a stakeholder or maybe your boss asks any of these questions, how do you answer them?
The key is observability.
And I think the key word here is accountability because when you put an agent out into the world, you can't necessarily outsource accountability to the model. You assume responsibility for everything the agent with all its agentic reasoning capabilities do.
Observability is the key infrastructure that helps you see what happened, diagnose what went wrong, and fix it.
The other aspect of observability that's worth noting off the bat is that compared to traditional type of systems where users would often take a fairly straight path through the product, right? Fill in this form, use this interface, click this button. With with agentic applications and all their reasoning capabilities, users will use agents in surprising ways that you can't possibly anticipate. There's no wrong way to use an agent. And this is both a little bit scary, right? Because what if they push against the upper limits of what's possible and maybe trigger the agent to do something it shouldn't, something for which you're accountable, right? But it's also opportunistic because it might be that users realize that if they use this combination of tools and leverage the context your agent has, they can solve a use case you hadn't really anticipated and that could potentially be something that you want to double down on both in your product and in your product marketing. And so, collecting this type of data, which we'll get more specific about in a moment, is key.
And so, there are three pillars to observability traditionally. Some people might argue there are a few more these days. We'll focus on the three fundamental ones, which are logs, traces, and metrics. Logs, I imagine, will be quite familiar because they come from the old world of software engineering, where you might log things like tool call results, internal some reasoning, prompts if they were generated dynamically, and any kind of like sort of logging around your server, perhaps.
There are also traces, which are end-to-end records of how a request travels through a distributed service.
This is the way you kind of look at a particular agent run, and you dig into exactly what happened. Because remember, unlike a HTTP request, which hits a server and comes back, an agent might call a tool, which then calls a workflow, which then might call a sub-agent, which might then fetch some contacts. And if you're trying to debug and understand a particular agent run, you don't just need this like slew of data in the form of logs, perhaps, but you need to draw a tidy line between them and sort of connect the dots to actually arrive at an answer. Because collecting logs and traces for the sake of it, that's not the objective of observability. The objective of observability is to make sense of that data to improve the agent over time.
I'll show you an example of a trace in just a moment.
The third and final pillar is metrics, which are effectively measurements of a system's health and performance over time. So, this could be token usage for the day, for the week. Of course, the cost on inference for those tokens over a period of time as well. This can also kind of veer into the monitoring territory. Observability and monitoring are two separate ideas, but they certainly have some overlap. And the idea is that if you monitor the latency of your agent, you might realize that performance is degrading. Another aspect of monitoring that's relevant to agents is looking at online evals. This is where you basically run scorers. These can be deterministic scorers that produce a number that let you know if the agent run was successful or if it unexpectedly failed. You can also use an LLM as judge to score traces, and we'll talk more about this. There are lots of use cases, but in the monitoring territory, it can present a graph and a trend of how things are going, so that you can kind of catch drift from the expected behavior before your users do.
And then what do you do? You go into the traces to figure out what happened and take it from there.
I don't want to spend too much time on slides in this workshop. I want to get straight into the demo. I just want to give you the minimum information necessary so we have a shared language and an orientation.
The way observability works in Mastra has two answers.
The first is that there is the Mastra open-source TypeScript framework, the one you likely use today.
It has a very powerful observability system built-in. Naturally, as Mastra is the layer that is interfacing with the inference providers and orchestrating your workflows and managing the agent loop, it's the most natural insertion point for things like traces and spans and logs. And of course, we have an observability system built-in with the hooks necessary so that anything you need, you can access.
But ideally, you don't have to, right?
Because we present in, for example, Studio a UI where you can dig in and see these things. But it is your data and you can do with it as you need to.
This data includes traces, logs, and metrics. And apart from instrumenting your Mastra agent, which by and large happens automatically, you then need to export that telemetry data somewhere.
The nature of any production system at scale is that you're talking about an enormous amount of data. I would argue that for an agent, that's even greater just due to the nature of how agents work. And then, for example, that the kind of game changed, right? Like agents can run on a long time horizon. Agents might call other agents. You don't know.
It all takes a different path, and so you end up with a lot of data, and you need somewhere to export it that is built for high volume, but also has the UI and the tools necessary for you to make sense of that data and fulfill the objective of observability, which is to actually use that data for something.
We have exporters for a bunch of providers you likely already know, and I'll show you a couple of examples throughout this workshop. We've also recently launched Maestro Observability, which is a destination for your Maestro Frameworks telemetry data. You can add a Maestro Platform exporter, and observability is part of Platform, and then all of these traces, logs, and metrics will be exported to Maestro Observability. We use ClickHouse under the hood, so it's built for huge scale.
And the really powerful thing here is that you can actually connect those traces to Studio. We'll get into this and show you some examples to run scorers and create data sets. And then you can also share access to this with your team. Of course, if you put stuff in a database, it can live in the cloud, you can share access, but you might not expect someone who isn't familiar with databases to poke through that data.
Because all of this you will see is available inside of Maestro Platform, you can invite teammates and bring them in on the journey.
So, what I'd like to do and I'll just take a quick moment to look at the chat as well in case you have any questions, is give you a quick example of the Maestro Frameworks observability system in isolation, and then we'll connect it to observability to close the loop.
What do you think, Joel? How does that sound?
All right, sounds great.
Okay.
So, I've already created a brand new Maestro application.
Let me just put on dark mode before I get canceled. Um and then you can and I've created it using the create Maestro CLI. So, this will likely feel familiar to you if you've started a Maestro agent before. It's the default weather agent.
And just as a quick recap, an agent in Maestro is essentially uh, model with some tools and a system instruction that guides the model to use those tools and achieve a specific open-ended outcome.
And then we can run the agent inside of studio using studio as a local development tool.
Here we can see all of the agents on our Maestro instance. Here's that weather agent and we can start to interact with it.
Before I run the agent, however, just going to make this, uh, a little bit smaller to focus on the code. I want to show you the Maestro instance. This is kind of where you define your Maestro project and it's where you configure things like observability.
And as you can see, our Maestro instance already has an observability config. And at the moment, it just has one exporter, which is a Maestro storage exporter.
Once you enable the observability config, your Maestro agent, the Maestro project will automatically instrument things like your agent and workflow runs. You don't have to manually do it.
The framework will take care of that for you. There are also lots of white spots where you can attach additional metadata to spans, for example.
And in this case, rather than exporting it somewhere for production, we're in development, right? And so we export it to the local observability domain, which uses DuckDB. DuckDB is fantastic in local development and we set it up with your new Maestro project by default. A good mental model for this is that it's very similar in principle to lib sequel in that you can run it without having a background service. The local file lives alongside your project. Um, and just to touch on this as well, I think it's a very common question we get is why do you use DuckDB for this or ClickHouse and other columnar databases? It's simply because relational databases like lib sequel or Postgres are but very well suited to the nature of traces and spans. And we can maybe talk about that a little bit a little bit more later.
But, you know, since we have the observability set up and we have this storage, when we go to the agent And so, yeah, since we have this observability set up, when we go to the agent and we ask it a question, uh like what is the weather in London, apart from the agent running, you're also going to see in the traces view here.
Let's give it a second to run, actually.
Hey Mino, welcome back.
So, yeah, we do have the agent run like so. And now, when we click into the traces, we can see the trace for the agent run that happened under the hood.
And as I mentioned earlier, I wanted to show you what trace looks like and show you how you can interact with it. Well, here's the basic idea. At the top level, we have the agent run for the weather agent. Mastra as a framework has some features built in like input processors that you might want visibility into while debugging an agent run. And so, you can see here for a message history input processor run. We won't belabor that too much. I think the really interesting part to show here is when you click on the LLM call, you can actually scroll down and see exactly what was sent to the LLM.
Here's the content of that initial message and here's the outputs.
This is not too exciting, right? It's a fairly simple instruction. But what this effectively represents is your context window. So, if you're doing some kind of context engineering and you want to see exactly what was sent to the LLM in order to debug something, well, you can do that as well as see the outputs. And we can dig into all the other parts as well, such as the tool call results. And so, it might be that a tool is failing and to establish a pattern, you can dig in like so.
Mastra also supports logs. There are no logs at the moment.
Um but the way you would do this is by calling like a logger function call just like you would in any traditional project. You can see that we have the logger set up here using a pino logger.
And maybe what I'll do is go to the weather workflow uh which is a different a workflow is a different way to accomplish a similar result in this case. Well, to get the weather or make a plan, you could talk to an open-ended agent. If you want more control over each step, you can define a workflow. So, the first step here is to fetch the weather. That just uses deterministic code in this case instead of a tool call results. And then it will call a step to plan activities. That calls into a sub agent to then generate the activities. But suppose I wanted to maybe add some logging here. Well, because we defined that logger over on the master index, we can then go to any step in a workflow, pass Masha, get a reference to the logger like this.
And then we can call logger.info or logger.error, whatever makes sense. Here I'll just say fetch weather.
But you would likely add some additional data here like some particular contacts for example.
And then when we run the workflow, let's just type in London again.
You're going to see that in the terminal down here, we see the log for fetch weather. This is handy in development, right? But additionally, and I'm actually needing to change one thing I realize.
Um by default, Masha observability doesn't store the logs. You need to enable it like so. And you specify a minimum level because maybe info logs could get a bit noisy. Maybe you only want to send error logs. Remember, every kind of event on an observability platform has some kind of cost associated with it. So, you might want to be uh selective about what you filter. But for this demo, we'll just set the info level. So if you now if I run the workflow, it will type in London again.
Uh you saw already that it logs to the console.
What I'm going to do is go to the logs tab here under observability. But if I'm If my suspicion is correct, we won't see anything just yet.
My suspicion is not correct. The point I wanted to make is that sometimes with logs and traces, they don't necessarily happen in real time when looking at an observability view because they are flushed periodically in batches. In this case, it seemed to run the fetch weather message.
And the one thing I really want to point out here is that logs on their own, yes, they're valuable, but what is tremendously valuable for debugging and getting an insight into how people are actually using your agents is to take a log and then connect it to the trace which produced that log or the agent run to produce that uh log.
Um this would be so handy in the case of debugging an error, right? Because then you can start to dig around and better understand what has happened.
Okay. So, this is an example of the Master Observability system instead of the open-source framework. Nothing about this is anything to do with platform so far.
However, we have a challenge which is that Okay, we're in local development right now. We're chilling. We're sending a few weather workflow calls.
But in production, you'll be talking about an enormous volume of data. And not only that, but you'd like to export it somewhere where you can share with your team and make use of, right? For example, adding traces to data sets to produce a golden data set that defines what good looks like or maybe inviting teammates to help kind of figure out why this went wrong or did it even go wrong?
Maybe the answer's ambiguous, so you'd require a domain expert's inputs. And that's where this idea of exporters come into play.
And so, Master supports a bunch of different exporters.
Um for example, you could come here and set an exporter um for Arise, for Brain Trust, for LangFuse. You just import the necessary packages. We also have a bridge for Otel uh syncs, as well.
However, I'd like to show you Mashtra platform in action today.
And so, let's go to the Mashtra website real quick and look at observability.
Metrics, logs, and traces for every agent run in production. This is a part of platform, fundamentally. And so, if you'd like to use this, what you need to do is log in or sign up.
And then, you can create an organization for yourself or for your team, and then we're going to create a project. I'm just going to call this my fantastic agent.
And then, what I'm going to do is take the environment variables and add them to my project. This is how you connect your local Mashtra framework project to the Mashtra platform, and in this case, observability.
Rather than leak my API keys today, I'm just going to append them to my end file from the clipboard, like so.
Then, we'll want to enable the Mashtra platform exporter.
This, in addition So, here's the cool thing about the exporter configuration, is that it's actually an array. You can export traces to multiple destinations.
In this case, I'm going to export them to the local DuckDB I showed you earlier, and I'm going to export them to Mashtra platform. You could, if you want to, also export them to your current observability system or another observability system if you're comparing multiple, because every platform, every product has a different way of presenting and treating this data. So, you might be inclined to run multiple at the same time, and because this happens asynchronously in batches, it won't impact your performance at all. And then, you can get a sense of how these things work with your real data.
So, having set the new environment variable and configured the Mashtra platform exporter, I can now go back to Studio running locally, Interface with the weather agent again.
This time I'll ask about the weather in New York City, just to distinguish it.
And what we're going to see is if we head back to platform and our project, my fantastic agent, and I just reload this page, you're going to see that the onboarding page has gone because it knows we're onboarded, it's received some observability events. You can see that our master project in platform has a few potential components.
Like Joel was saying, you can pick and choose the right products for what it is you're trying to accomplish. We've not deployed studio or server yet, but we'll show you deploying studio a bit later.
We have deployed observability such that now if we click in here, we can see our metrics dashboard. This is what I was talking about earlier, where you see trends over time, like agent usage, cost, token usage, latency, all that good stuff. We're looking at the last 24 hours here, but we can filter by time as well. Imagine this glowing and coming to life with all your real data. I think that's really exciting. And then we can also see in the traces view the traces produced by the agent locally. In this case, we're asking about the weather in New York City. And it the same kind of logs are output here. Although in production, I might set a more high log level as only to get error logs or something like that.
I totally get it. If you're looking at this and comparing it to what you saw before, you might not be totally clear what's different. But whereas before we were looking at studio in localhost, this is on mine projects.master.ai.
You can send an enormous volume here and the logs and the traces will just flow in. And then because this is part of platform, you can actually go to your organization settings Here's org settings and your team tab, and you can invite different team members to come and look at these metrics and potentially dig into traces as well.
We'll show you the ways in which that's useful when I pass it over to Joel in a minute here.
So, we've covered the Master Framework's observability system, and I hope it's really clear how that connects into Master Observe and the advantages that it serves.
I just want to touch on a couple of other little features of observability that I think are worth noting.
The first is that the observability configuration supports output processors. And what I highly recommend you enable is a sensitive data filter.
This will remove things like passwords, tokens, and keys before they go to your observability export destination. This is very, very good hygiene, right? You just don't want that headache, but it's also essential for compliance, right?
The other thing I would like to show you, although I won't demo it right now, is that you most As I mentioned previously, you pretty much pay for every event that you send to some kind of platform, right? Um it's often worth it, but you don't want to be extraneous either and spend money you don't have to, or even overwhelm the system and create a bunch of noise, so you have to filter through more data. That's where sampling comes into the mix. This is And there's a few different sampling strategies. One sampling strategy is to just send everything, by the way, but it never is to send things based on a ratio. And so in this case, we set these the sampling type to ratio and the probability to 10%. And this means that only 10% of traces will be sent to the observability export destination.
You might think, well, I don't know if that sounds like it's not enough or like what if I miss something important? But trust me, when you're looking at an enormous amount of data, while you want the ability to dig in to individual agent runs, what you're mostly going to be interested in are trends. And if your agent has some kind of failure mode, well, 10% is likely enough to surface that failure mode such that you can dig into it and understand what's happening.
If you feel like it's not enough, because it's proportionate, of course, to the number of users you have, maybe you have just a few hundred users, in that case, maybe you send everything.
Well, you can always dial in and adjust this number, as well.
And so this is important because it allows us to monitor our agent with the metrics dashboard.
It also allows us to observe what's happening and if a new user comes to us with like an intermittent bug report or anybody asks any of those questions in our first slide, we now have the tool to give them the answer.
But observability doesn't just end there.
It's not just a debugging tool. It's not just a monitoring tool. Observability and in particular traces are so valuable because they give us an insight into how users are actually using the agent. And you might not just want to read this data. You might also want to do something with these traces. And that's exactly where studio comes into the story.
And so Joel, I thought I could hand it over to you to share a little bit about what happens next. You've got this data, you have observability set up. What what happens next and what more what can you do?
Yeah, sounds great. Let me give a minute for anybody who might have questions uh so far. There might have been a few that I missed in the chat there uh and I'll get my get my screen set up for us here.
We're not hearing me when I talk.
You sound You sound great.
Okay. Well, it's the second time Okay, some people are having trouble. Nick says he can hear. All right. Uh let's keep going then. Haha.
All right. Uh let me share my screen here.
I'm just going to go ahead and share an entire screen.
Okay. So, like Alex mentioned, there's observability, which is great. You can collect a lot of data. You can see what's actually happening um in your system, which is awesome. But that's just observation, and observation without taking any action or improving of anything, it it's it's not really it's not really actually making your product better. It's not making the agentic experience for your users any better. So, what do what do we do with that? Um let's take uh want to kind of talk a little bit about the kind of like full suite of what's available in the platform. So, what Alec showed, and I've got kind of a sample agent who whether I'm running the server and studio locally here, and maybe I deploy that to any deployment target of my choice. Maybe my team's running everything in an EC2 instance in AWS, or I've got some serverless Cloudflare or Vercel, whatever that might be.
Um and I just kind of need the observability store. Um I don't want to set all that up. Managing a database sounds hard. So, I've got that project here. Um but I might actually want to have a quick studio URL that I can share and start collaborating on. Um so, in the age of agents and CLIs, um we are CLI first with our deployment strategy uh for now. Stay tuned on some other options coming soon. Um but it's as easy as uh deploying with the CLI. So, let me show a quick demo on that, and then we'll get to Okay, great. I've got all this great observability data for my agent uh Lewis here, named after Daniel Day-Lewis, greatest actor of all time.
Uh this is a film recommendation uh engine. So, I have some metrics, a little bit of metrics. Uh I've sent some traces, and I've also sent some logs.
Yeah, I can try and make the screen a little bit bigger here uh for both of these.
Uh if that is better.
Um so, no logs, but I do have some some traces in here of really sending from my local instance, cuz I've hooked it up to the cloud uh back end. So, I'm sending my traces there. My team can collaborate on that. That's great. Um this is all good, but I can't really improve uh my I can't work with my team to improve this and look at traces and run evals or anything. So let's go ahead and do that.
I'm going to do that with the CLI here and I'm just going to really quick I think I'm just going to check my off org list.
And see org I think I just want mastra org.
Okay, that's not the right command.
Let's let's go for deploy here. I won't fuss around with that. All right, mastra studio deploy and we're going to push up a studio for Great, that's not the org I want so let's try mastra off orgs I believe list.
Nope, that's not it.
>> also copy it from the slug in the URL.
Uh Yeah, okay. We'll see.
Let's go ahead we'll skip this step and let's use let's use my clover agent that's already deployed. I should have I should have actually let's just do a little help. Mastra help and see what we can get here.
Why is that being fussy? Let's do PM PM mastra help.
I think my global something happened with my global install.
Oh no. Oh try do you need to do PMX maybe?
Let me try that. Mastra I don't think that was quite it but let's see.
I think it's not it's not liking my help ask there. There we go.
There we go.
Off okay. Let's do PM PM mastra off Okay, mastra orgs switch.
There we go.
I knew it was in there. So I can quickly this is kind of an internal feature where we can be in multiple orgs for testing etc. So most of you are not going to have this.
But all right, great. We've gotten into my organization which is what we're looking at over on the screen. Thank you for the one clap I got in the emoji reactions there. All right, so let's go ahead and run that studio deploy. Now it should because I have an a environment variable set here. This should be able to pick out the correct org and project that I want to deploy to. Let's check that real quick. One thing I'm going to look at here is do I have I don't have a dot mostra dash project file. So that's kind of our project configuration file that we use when you're deploying the via the CLI. I don't see one there which means it's going to look in my environment variables. So let's give that a shot. Mostra studio deploy and this should have the right um Let's see. Does that UUID A2? Yep, that looks correct. So let's go ahead and deploy with those settings.
And this is going to grab my local environment variable file and it's going to zip it and ship it to our platform and we'll run a studio instance there with my Mostra instance all set up as part of it. It won't be ooh excellent.
Okay.
Let's go ahead. This is a good check.
I'm trying to ship a local DB and it's not quite working. We have a check on that because if you're just using like a local lib sequel file and you try to ship that into production, you're not going to get any of your memory stored.
It's going to be looking for a file that doesn't quite exist there.
So let's we'll just skip that for now just for demo purposes.
Going to skip the preflight.
I thought I had that set up correctly.
Um Let's maybe let's maybe bail on this. I think we're getting a little long in the tooth on the deploy example here. Uh I'll set I'll figure that out, but we do have one over here with the uh Clover that I did set up has a lib SQL in Turso and is all set for us. So, let's let's go check out studio here to kind of talk through some of the studio pieces that we did want to focus on.
Okay, so in my example, I was running studio locally, but what we're doing here is we're taking that same studio and deploying it to platform so we can access it behind a URL and share it with the team.
>> That's correct. But, into that point, I do need to have a database there. So, if I want to make changes to my agent or I want to do anything interesting, I I do need to have that backed up by a database, uh which didn't didn't get configured correctly in my example here.
So, we'll we'll sort that out another time.
Um but, here is one >> to fuse in Turso, uh like your project does. That's a good aspect.
>> Yeah, exactly. So, if we even go back to my Clover project here and look at it, you can see that it's telling me I do have kind of Turso correctly configured.
Um I do have a server deployed for this.
So, I have a kind of production-grade API that I could build any kind of agentic interface against, and then I've got studio, which is our spot for collaboration.
Um plug adapter for observability.
Logs will be absorbed by Splunk.
Uh Say more on that. Are you hoping to send the logs to Maestro platform or you still hoping to spend send them to Splunk? I think should work either way, I think, but I'm just curious what the use case is.
Follow up on that. Okay. So, you know, we where we started and where we started this conversation was, well, what do I do once I've got observability data, right? So, for my Clover, my my gardening um let's do a little do a little bit more here. My gardening agent here, my wife and I are kind of newish to the gardening world. We planted a bunch of fruit bushes and trees last year, trying to do vegetables this year. So I thought, hey, I need some help on my watering schedule, I need help on kind of cross planting, what's good in terms of soil quality, all that. So I built a little agent that will kind of answer some of those questions for me and gave it gave it an MCP server to go check weather and that sort of thing. Um and like Alex said, I've got all of these traces which tell me what kind of you know, what in the non-deterministic things happened in those agent interactions and so I can click in to see any of those. But if I want to improve the quality of my agent, I want to do things like continually improve it. I want to run evals against it. So let's talk about how we do evals in the Marsha world.
The core the core concept behind evals and and Alex kind of touched a little bit on this is the idea of a score.
So a score allows you to take either live conversations, so you can run scores against live traces that come in from your users. So and again, you can do that on a sampling rate. So you could say, let me just pull up a code example here, on any given agent, you can say um I think this is for my clover agent here. So I've got a bunch of a couple of different scores just to kind of example. One is a traditional completeness score, i.e. how many of the words in the initial prompt were used in the response. So was there kind of complete response to the initial question?
Um and I can run that live. I have that set to run at a very high rate, that's probably not recommended, but that allows me to kind of see over time as a general rule of quality, how is that working? And this is a code score, meaning there's no LLM as judge here. It literally is counting words. So, it's pretty cheap to run in terms of compute.
I'm not racking up any kind of inference costs or anything. Now, a couple of the other ones I want to check though were tool routing. So, it does have a kind of sophisticated set of tools. This agent's got a sandbox. It's got MC a bunch of FCP server calls that it can make. So, I want to know is it correctly using those tools? And so, I've got a tool routing LLM's judge that says, "Hey, when a user asked a question, did you appropriately go and do that?" So, one example, and I'll show this in a in a minute for this is um checking, "Did you go to the sandbox and get the list of plants from that markdown file? Or did you just make it up? Or did you call it from memory?"
Right? So, what was what what kind of happened in that situation? That one again, I'm I'm running that live here.
I'm sampling that at about a 30% clip.
And then registry faithfulness, this one is more specifically, "Did it actually pull in the plants that I listed in my garden? Or is it giving me advice about random stuff that it thinks might be adjacent to the stuff?" So, kind of a little bit of like hallucination checking there. Um and again, that's sampled at that rate. So, scores can run live or and as I'll show, you can kind of run them in a manual eval loop. Now, how do we do manual eval loops? We do those off of data sets. And data sets are Think of it as a collection of inputs and kind of ground truth output.
So, a question that one of your users might ask to your agent or might send to a workflow as it as to kick off a workflow if that's required. Um or to um a like just directly to a score. You can you can experiment against that. Um so, for instance, this item, the input is, "What plants do I currently have in my active list?" A good way to check if it's actually going to that sandbox and grabbing my list of plants or not. And then the ground truth is, you know, there should be this expected expected answer back, right? So, that's under that registry category. So, a data set, you can think of it as inputs and outputs. If If you're kind of just getting started, one little one little hack I used when I was kind of learning about data sets in Mantra, um Actually, I don't know if that's going to be here. I'll I'll save that for later if we come back to it. But, we have added some kind of agentic data set kind of generation. It's also really simple to tell your coding agent, "Hey, can you make me a data set that does ABCD?" And it's all part of the kind of Mantra core SDK. So, it just can go and do that and call all the items and create it for you.
Pretty sim- pretty simple to do.
So, let's take this data set, and I feel pretty good about it. I think, let's say it kind of covers all my use cases. It's going to make me feel comfortable, and I can go in, and I can run an experiment on this. I'll click run experiment, but we'll go look at a previous run cuz it does take a little time. There's a lot of calls it makes, etc. So, I won't I won't bore you with watching a whole experiment run, but we can look at the outputs of it, and then we can we can kind of see the results. Okay. So, again, we have created a bunch of scores. Now, we've got a data set to kind of test against those scores, and we'll have experiments that we can go back past experiments we can go back and look at. So, let's go ahead and click run. We're going to select the target.
We're going to do an agent The specific agent we want to do is Clover. And then, you can pick the the scores you want to do. We'll just select them all for this case, and then let's run that.
All right. So, that is going to kick off that experiment. So, it's in progress.
Nothing nothing there yet. But, let's go over to experiments, and we'll look at some of the past ones that we have run.
So, here's one that I ran. Looks like >> watching one of those baking shows where they're like, I pull it out of the oven. It's already prepared. Magic.
Yeah, this is a little more Great British Baking or you could bake off the Great Do you call it the Great British Bake Off? This is a little little cultural crossover moment here, but you know, a little little more mess a little more mess. I'm throwing the the unset ice cream into the trash sort of thing.
All right, so let's look at the the results here. So, we have this experiment that we ran. The target was the Clover Assistant and you can see it ran against that data set that we looked at. So, let's go let's go take a peek. So, if we take that input item that we looked at, the first one in the data set, what plants do I currently have?
You can see that we ran three different scores against it. LLM tool call, it did not do well at. So, let's go take a look at that.
Okay, so this is great. So, this is showing me because this was a LLM as judge scorer.
Let me see if I can just kind of keep up with the chat here.
Because this was the LLM as judge, it actually an LLM looked at the input and output and said, "Hey, did you actually do what you said you were going to do?" And in this case it didn't it didn't go and get my registry list. It just kind of pulled it out of its hat. Maybe it pulled it from memory, so that's like okay in this situation. But, that does give us kind of what what the input was and then the output as well. Okay, so like what did we actually get out of that?
And then that was the prompt that it came up with for that. And this the the score prompt wasn't wasn't configured there. All right, let me click out of that, but just to kind of show this was kind of my full evaluation loop on top of this.
And then here's another example with the the registry faithfulness. Did a little bit better cuz the answers were correct there. And in the case where I've got a completeness score, I don't quite have it just kind of gives me a score on that. It didn't list back. Maybe this is one where I don't quite care if it uses the word currently in its response. So, that's all right. I'll I'll take a I'll take a a two out of 10 on that. That's that's a-okay. So, that's kind of the heart of of the whole eval loop.
>> [snorts] >> Yeah, exactly Michael. I think that's more or less what you would want to do here, right? Is you'd look at it and you say, "Hey, I need to like I need to go back to my system prompt and I need to be stricter about how I'm telling it to to do like to which tools to pick up and which ones to use." So, I would go back.
I would make a new version of my agent and I would would try that again. So, one thing I don't think I have it set up here on this one, but it's a good example of maybe the editor where if you add an editor class to your agent in studio and anybody who's got access to your studio and and has the right permissions to do it can come in and edit. So, I think I have a local example of that that might be interesting to show if we if we have time, but I know we're we're coming up on our hour here. So, do you want to leave some space for for any more questions?
Yes, I actually think we're going to be talking about it in our newsletter today. So, check your inbox. But, basically anything anything that you can do with observability or eval's data sets, etc. That's all going to be in the command line. So, you can run that all from your maestro CLI, which means write a couple of skills, pull some skills from the inner webs, etc. But, yeah, you can you can tell your agent how to run all of this and run it with CLI commands. So, so totally doable in a local or, you know, even even if you want to get fancy, you can probably do that in a remote setting, too.
Can we go back to the data set for a second? Yeah, absolutely. I'm learning about ground truth and I kind of had this idea of a a ground truth is like the answer you expect from the agent, but of course an agent gives a different answer every time. So, a traditional type of test like that wouldn't really work. It It looks like the ground truth there is encoding like some like kind of what you want the scorer to align with when it produces the score. Is that right? And how does the scorer like read that ground truth and know what to do with it?
Yeah. That's That's a really good question, and I'll I'll posit this all with a grain of salt that I'm I'm not the expert on evals or any of this. Um and my My way of thinking about ground truth is it's a rubric.
And so, you know, I used to be a We were talking before we hopped on. I used to be a a high school teacher, so I spent a lot of time writing rubrics. I taught writing, so it was a little bit more of the LLM as judge, where you're passing a rubric that that is requires some judgment, requires some looking at inputs and saying, "All right, is this accurate? Is this correct? Is this well-written sentence? Is this a sufficient argument?" etc. Versus a like code-based or kind of deterministic um ground truth. So, for instance, I think one example we typically have is uh um on uh it's like a translation agent.
So, an agent that would translate uh from one language to another. And in that case, you That can be pretty deterministic because you know if the if the agent translated it correctly cuz you you know, anyway, you can kind of check that more and you know, uh more more mathematically, more truthy.
So, like your scorer code like the cut the Clover scorer you built, that reads this ground truth basically.
Yeah. Yeah, uses that and passes that as context to LLM as judge of like, "Here's the question. Here's kind of the expected. Now, what What does it get back?" And then it's able to use that as part of the comparison. That's right.
I guess you probably want a balance of like deterministic scorers that just use code, right? Like the completeness one.
Um but it just uses an algorithm, old school, right?
>> Yep. Um Yep. And then LLMs judge like if you run this on every trace, I guess that kind gets kind of expensive kind of quickly.
>> That's right. That's right. That's right. Um yeah, it can it can get expensive. Um so that's certainly why you want to limit that or even just just don't don't run that on live traces. Um oh, the other thing I I did want to show cuz I think this is like, you know, you could run you could run traces um or you could run scores against every tracer kind of user interaction. The other thing is you can actually build data sets out of your traces. So let's say you had a really interesting interaction. So I'm going to go farther back here to uh let's add a custom range. Let's do just yesterday. Apply that. And so if I go, let's take a look >> we're showing these filtering features, by the way. Like the ability to search and filter and narrow down things is really valuable, for sure.
Yeah. So let's say for instance, I I'm just kind of picking one out of the hat here, but I I look at this one and I say, you know, hey, there was like a uh this was a really good trace. And I I want this I want this as like a ground truth good experience or maybe a bad experience, like it's adversarial. I can just click save as a data set item. And then I can add it to the existing data set. I can I really need to add a create new data set button there. So just note note to self.
When you work in product, all you ever see are opportunities. Uh but that's that's I see that there. Uh so we can add that to any existing data set. And now, anytime we run our eval, we're running it against that real world trace that we pulled in. So that becomes that's pretty cool cuz that can be like a little bit more sophisticated and like real world, you know, realistic than than just whatever you might make up or um your your agent might make up.
We can certainly like um look at some questions before wrapping up here.
I think Joel, one thing I've seen come up a lot when I've been researching and learning about e-vals is is this idea of like data labeling, right? Where maybe you work with a domain expert, suppose you're building some accounting software or a medical agent, and then you as a developer don't really know what is the right answer, what is wrong, what is ambiguous. So you kind of invite a domain expert into the platform, in this case studio. I mean they can kind of put a little data label next to things to say, oh that's right, that's wrong. And you start encoding basically failure modes that you can then solve um with your agent. Is that something that's on Master's uh is that something that Studio supports today? And if not, is it on Master's road map? We we have we have it. Um I think I think making a really great story out of this is is definitely part of our road map. We have this um in the agent evaluation thing, so you can pick like any of these interactions for instance. So let's say this was actually a good one, even though the the tool call wasn't correct, it does actually have my list of plants. So that's good. So let's say like I might, you know, I might want to send this to review.
Okay? So now I'm in my review tab. I can hand this off to my subject matter expert, the gardener. This case happens to be my wife. And I would say, hey, does this match our list of plants? And then I also maybe want a a LLM expert, maybe that's more me in the relationship, uh to say, hey, did this like use tools correctly? So I might say, I might like, you know, uh you can add any kind of tagging. Let's say like, Oh, cool. missing tools, and I can create that tag. But then I might want to make a note, which is like, despite missing the tool call, it was correct.
Might be memory.
So now I've got that stored, and anybody can kind of come and see that. So, we'll mark that one as complete. I've reviewed that. It's now out of our review queue.
So, we've got it's very kind of agent-centric right now. We've built a kind of a lot of interesting kind of um capabilities around agents specifically, but we definitely want to expand that out to workflow calls and any kind of trace that might come in.
Fantastic. Thank you, Joe.
Yeah, absolutely.
If there are any questions, you can let us know in the chat.
But otherwise, I think we're pretty much at time here. I'm just going to let you know about an upcoming Master workshop.
Next week, I'm joined by Ward, who is the head of open source at Master. We're going to go deep into Master streaming architecture. So, you have an agent, of course it's great to iterate on it in studio, but ultimately your goal is to plug that agent into a real application, right? And that's where you start to kind of connect your client front end, be that a React app and mobile app, to your Master agent.
Yes, we stream the agent tokens when you do a simple agent run, but there's so much more to it, right? For example, if your agent calls a sub agent, maybe you want to show that sub agent's outputs.
Master also has a workflow orchestration layer, and so you might want to stream the progress of a workflow.
And of course, you probably don't want to build every UI layer from scratch.
So, we'll show you how Master integrates with tools like AI SDK UI and Copilot Kids, so you can build really nice user interfaces that put your agent in front of the people you want to. So, check out your inbox after this workshop. I'm going to personally email everybody with both a link to this recording, the slides, and a link to register next week in case you want to join us at the same time, at the same place.
Thank Thank much, and I hope you have a fantastic day. Joel, thanks for joining us. It's been great. It's been great.
Thanks for having me, Alex. Ciao, everybody.
Ciao.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Instagram accounts got PWNed
EricParker
13K views•2026-06-03











