Agent observability differs fundamentally from traditional observability because agents are non-deterministic systems that generate highly voluminous, semi-structured traces with unstructured text data (individual traces can exceed 1GB), requiring specialized databases with write-ahead logs, analytical indexes, and full-text search capabilities, and enabling non-technical users like clinicians and lawyers to participate in quality assurance through human annotation that becomes training signals for automated scoring functions.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
How agent o11y differs from traditional o11y — Phil Hetzel, BraintrustAdded:
Uh thanks for joining me today uh towards the end of the day here. So, I hope um everyone has enough energy left for um maybe what is your most excited topic of of the day. Remains to be seen uh how traditional observability differs from agent observability. Um quick agenda.
I'll do a quick intro about myself and the company that I work for. Um this is not going to be a very product forward talk. It's going to be more theoretical.
So, I won't draw drown you in sales slides, I promise.
Um and then we'll get into uh how these two ideas differ and also talk about um um like what's next in the um in the space.
Uh my name is Phil Hetzel. I lead solutions engineering for BrainCert.
What that means effectively is that me and my team were the folks that uh are charged with making sure that um our customers are getting the most value out of the platform as quickly as possible.
Prior to BrainCert, I spent 12 years in consulting and systems implementation.
Uh I led the global Databricks practice for Slalom Consulting before I came here. And I noticed that a lot of my customers were prolific at creating generative AI proofs of concepts, but not nearly as good at bringing those proofs of concepts to production. So, I started using BrainCert as a user first and I really liked it and I applied for a job and I've been here for about a year.
Uh I like to play chess out outside of work. I like to spend time with my wife and dachshund. Um that's Pistol Pete right there. It's That's That's my dog.
Um he's a person in brown and not black.
Um Those of you who have been to my sessions before didn't laugh at that joke cuz you've heard it uh at least at least once already.
Uh what is BrainCert? BrainCert is a um agent quality platform. We mainly look at agent quality in two different ways.
Is your agent performing as well as you thought it would when it's in production?
Um i.e. uh can you remain confident in your agent? And then on the other side of that is as you're experimenting with new versions of your agent, um do you feel like you can become confident as you as you tweak it and change it over time?
These are a couple of things that that Braintrust does, obviously relevant to today's discussion because agent observability is a massive part about what we do. Anyone has has heard of Braintrust? Show of hands. Before this week, did you hear of Braintrust?
Yeah, okay, couple of folks. Well, welcome back for the folks that have heard of us before.
Um I'm going to go like pretty quickly through the slides. Hopefully, you have enough time for questions as well. I don't have like a ton of content.
Uh traditional observability is established. So, even when uh folks come to us, they'll say, "Well, we already have uh open source tools like Grafana or or as an example. Why wouldn't this be the same problem that we're solving with uh perhaps either an implementation or a contract that we already have?" Uh it's very established. And um we know that these applications can operate at scale.
So, um the case that I'll be making is that uh the scope of traditional observability is actually quite different from the scope of agent observability and and I'll explain why.
Scope of traditional observability, uh it's all about uptime and technical performance. Um is the application up and is the application giving a user experience from a technical lens that we would expect? So, um latency, duration of of interactions, 400 and and 500 level errors, these are all things that we're measuring with uh with very established tools like a Grafana, like like a Datadog. Um I I will even say that uh at BrainTrust, though we are an agent observability platform, we're happy users of DataDog. Like it's it's great for this specific type of use case for us to understand if people are running into 500 or 400 level errors on on our website as an example. Is the system operational? Are we up or are we down? That's what traditional observability is.
The building blocks of this are a couple different things.
Um, metrics. These are the things that you're you're measuring. I gave a couple of examples before, but latency is is is the most obvious one. Error count is another. The things that you can aggregate and and and measure over time.
And then traces and spans, um, is everyone here familiar with the observability? Does everyone know what a trace is? Okay. I don't want to take that for granted. Trace is just like a full interaction of of some workflow.
Um, and a span is just one step within that interaction. Uh, all of these things would apply to agent observability as well. So, we have we have the same, uh, building blocks.
Um, problem one for why agent observability is different. Agents are non-deterministic, whereas applications are deterministic. The reason why we love LLMs so much is because they have high variety. They can do a lot of different things. They are abstracted.
So, because of that, while typical applications have very deterministic code paths by, you know, um, and and it it's it's on purpose that that they do that, where they're performing some type of known control flow.
Agent applications are very much non-deterministic. We're curious about why an agent might take one path versus the other. Um, this also means that traditional observability, um, uh, is going to really have to focus on very constrained and known metrics, whereas agent observability, um, needs to be a little bit broader in terms of the things that that it needs to measure. This is This is just an example of that. Um so, at the bottom, let's start there.
Agent observability can can measure some of these more traditional metrics. I Albeit with with more of a AI flare. Time to first token, total tokens, duration, latency. These are all things that you would think would be very traditional observability level metrics.
But also, you might want to understand more qualitative things about your application. So, it's not just how how long did I take to start responding to my user, which is more traditional observability. I want to know was the information that I gave grounded in the context that I gathered with my application. Did I use the tools that I would have expected in this in as I was reasoning towards my response? Is is the response aligned to the brand standard that I set for this agent in in the system prompt?
These are all things that are not really able to be tested by traditional observability tools because if you think about it, like the the trace necessary, the information in the trace that's necessary for us to compute these things up at the top is far larger than the than the volume that a traditional observability trace would handle.
Um That kind of goes to to the next point here. Agent traces are really nasty.
They're in in a variety of different ways.
Uh they're nasty because they're highly semi-structured. Even within those semi-structured, there's a ton of unstructured text data that we need to chew through.
They're voluminous, so they can be an agent trace could be over a gigabyte in size. We've seen that. Uh even with our own customers, an individual span can be 20 megabytes in size. So, it's it's just a far different systems problem that you have to solve in order to ingest, process, and most importantly use that that type of data.
And also, it's it's just as fast as a traditional observability data. So, hopefully, your your agent that you're putting in production gets product market fit, and you have a ton of users and and and usage associated with it.
You as the AI engineer or as the product manager for that agent, you're going to want to see that observability in real time, in true real time. Uh, trust me, we know that's the case because we always get the feedback.
Can you Can you just make it faster?
We're always trying to make it faster.
People always want want it to be faster.
Tough to do when the agent traces look like this, basically. This is just like an example of an agent trace in Braintrust. Where not only does it have a bunch of spans here, encompassing the model calls and and and and tool calls, but even within those spans, you saw the amount of unstructured text that's in there as well. Very different problem to solve.
Um, a little bit a little bit more here.
Um, like very like maybe I'll just dive in into the read pattern piece specifically.
We need to do uh two things simultaneously.
We need to be able to perform like the very fast read ingest and read style workflows that are common with observability, i.e., if someone does an action with my agent, um, uh, I need to be able to see that interaction basically instantaneously.
We also have to commit to read patterns where someone wants to use our CLI and fire off SQL commands to us so that they can incorporate either observability or eval traces to improve to improve their application automatically. There are just a lot of different mediums that people use now in order to query these very large trace shapes.
Uh this is a new this is a completely new systems problem. Um at least at Braintrust, we designed a a database from the ground up specifically for agent traces.
I'm not going to really go into depth about this. We have a We have a blog on our website. I think it was the last blog that we published. If If you're really interested in diving deep, but just very quickly, there are a lot of different components that we have to build into this database in order to make it work. For example, we need to um immediately get data into a write-ahead log so that people can instantly see these traces as soon as they expect.
We need to be able to perform indexing on these data so that whenever someone is performing a filtering or analytical query, that it's fast. And um we have this thing called a a Tantivy index. Tantivy is a um is an open-source framework that we forked. Anyone know what Tantivy does?
Any guesses?
Tantivy is how we perform uh like text-style in indexing.
So if you remember when I was showing this trace, it makes so much sense for someone to want to perform the workflow of okay, I would I just want to know every trace that had the word Amazon into it.
Well, it turns out it's really hard to do that unless you perform a a full text-based index across your traces.
That's another reason why um agent observability is far different than traditional observability. You really don't have to think about the text problems in traditional observability.
Is that the same as open source?
Sorry? It's kind of like an open source uh, or Tantivy is most similar to like an Apache Lucene, except it's uh, written in Rust.
Yeah.
Um, and then all of these things uh, come together and have to be unified through a SQL or SQL-similar language.
That's what we've That's the route that we've gone to at Braintrust.
Um, problem three, uh, this is a whereas there's a very specific type of persona for traditional observability. It's a systems engineer, maybe it's a product engineer. Um, it's probably not a subject matter expert or if it's a medical application, it's not a not a clinician or or or a registered nurse. It's very technical people that align with traditional observability.
Uh, that could not be further from the truth for agent observability if you're doing it well. We notice that the best teams that are building agent have both technical and non-technical people in the fold performing this work because it's the non-technical people that are either A, closest to the users or B, have knowledge that is closest to the problem space.
And what can they do now with prompts?
They can write it in natural language.
So, they can add real value into being able to participate in in agents.
Um, we have folks that that are clinicians or registered nurses or wealth advisors or or or lawyers. We have seen them operate in our platform looking through traces and using that information to improve their agents.
That is a workflow that you that you simply don't see in traditional observability where you're more worried about uptime.
Um, I think it like in in general people don't realize that in order to perform observability and and also evals well, we kind of think of observability and evals as the same problem. The only difference between evals is that you're running them in batch and you know the inputs ahead of time.
Um, it's it's incredible It it's the the depth that you end up going into when you create a platform like this. It looks like it looks like this um, because of the the reasons that I've described. The nuances with the data, um, the amount of and and types of people that you have to bring into the fold.
Those are some of the reasons why it's so different to perform in this space.
Uh, where is this space going?
Um, we we've done a lot of work in this area. I think the the natural question that we used to get asked was if you're if if you Braintrust are collecting all of our agent traces and all of our agent traces have all of these valuable data in them can't you just tell me how people are using my agent? Um, and it's a it's the simple questions that usually need the the most com- complex systems behind them.
Um, this is something that we are starting to do. Um, we we just rolled it out I think about a month ago in in our software as a service offering where we see agent observability traces come in and then we'll run like a very lightweight LLM on top of them to perform embedding and then clustering on those traces to see how we can perform like elevate topic uh, elevate topic modeling to see for example how people are using uh, your traces, their intent, how people are feeling about interacting with your agent, the sentiment, or if they're running into issues what those issues potentially are.
The whole idea is there is that you can um, make the iteration loop between a problem that you're seeing in production and the fix that you perform experimentation on. Whole idea is to just make that faster and a little bit more direct.
Um, I promised I would go through that really fast. I've got about 3 minutes for questions if there are any uh, and I'd be really happy to answer them.
Anyone curious about this? Yes.
So, BrainTrust is clearly about uh the the functional observability of agents.
Would you say I would say technical as well. Would you say it's also good for non-functional agent performance, or is traditional observability good for that?
That's a good question, yeah. I would Well, I think traditional observability can do that. Um BrainTrust specifically does do I like the way that you put that functional observability. Um what's the quality of my agent, how I've defined it.
And then the technical observability just kind of comes on the house. Like when when when you when you trace the application, you automatically get prompts duration, time to first token, etc. Cache hits, etc. Yeah.
In your iceberg slide, Yeah. you have human annotation Yeah. at the waterline there. Yeah. Could you explain what the human annotation part is in this product?
So, let's think about it this way. Um actually, if if I can, I'll go on a high wire act here and um and just show and and and not tell.
So, let's say that you have a trace come in and you want your product manager to be able to opine on whether that agent did a good job or a bad job.
It's really valuable for you to have an expert come in, grade the agents, but then also like justify why they're grading the agents the way that they are. Cuz eventually, you're going to take those justifications, you're going to probably run an LLM over it, and you're going to make more um scalable scoring functions from those justifications. You're finding the failure modes that you can then implement in automated scores through that.
Yeah. Human annotation is a really key part of this process.
Uh yes, in the second row.
Yeah, I just have a question Uh, because I I agree you know you just focused on observability things today, but I'm interested actually how you also integrate with the other genetic framework to have a closed loop like for the offline optimization. And then also um I guess the yeah, the I think the main uh difference I feel that I see in this your uh your database is that you're not using OLAP because that's the We used to use ClickHouse actually.
Yeah, we moved away from it. Yeah, I just wanted to curious to learn why you build your own. Like what is the efficiency of that?
Well, the the funny answer there is that our um our founder is kind of an insane person. Like he like only an insane person would build their own database, but he he is he is cut from that cloth.
Um he he was one of the first employees at SingleStore, so he's kind of used to doing that. Um but what he found was when um I think it was let's see this slide. He found that when um he was performing some of these workloads, he just needed the more of the text based uh index indexes, which ClickHouse wasn't really able to do at least at that time. So we we built our own.
Um and then the first part of your question, observability and evals to us it's like we solve it with the same system. The only difference is that with evals, we know the inputs ahead of time with and we're doing it in batch. With observability, we we don't know what the inputs are ahead of time and we're doing them in real time.
Uh but I mean the experiment functionality like how how easy is it to integrate with like like I'm training on Oh yeah, it should be pretty easy. Like when once you've traced and I I I apologize that I'm making this like about the product. Um but when you when you have a a trace come in, you've traced it and then you just like add it to an offline data set basically so that you can experiment upon it. Yeah.
I do we have do we have I'm not sure if there's anyone after us in this room. Do we have to I'm here.
I'm not sure. Okay.
Oh, is it? Okay. I'm I'm happy to go on then. Yeah. I want Great.
Perfect. Yeah.
Yeah.
Um We can Do you want to talk specifically about BrainTrust? Like for that answer? Yeah.
So, like there is like the online scoring piece here where it's like a known unknown where you can like very much put a score behind that.
Uh but also there are there are ways where more like like this is these are not scores. This is like the unknown unknowns piece.
Got it. Yeah, thanks.
Um where we can in a in a more like open-ended way derive insight from it. Yeah.
Yeah.
Probably time for a for a one question.
If not Great. I appreciate everyone's attention today. Thank you.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











