Generative AI agents should not be isolated to data scientists or machine learning engineers because the underlying models are already built by companies like Anthropic and OpenAI, shifting the focus to prompt engineering, context engineering, distributed systems, and human annotation—skills that require diverse expertise including product engineers, systems engineers, and domain experts rather than traditional ML engineering skills.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Does GenAI "belong" to data scientists? — Phil Hetzel, BraintrustAdded:
What we're going to do today is we're going to talk about whether agents or agentic development really belong to uh data science or machine learning engineers. How many people here would describe themselves as either a data scientist or two or B machine learning engineer?
Okay. This is going to be awesome. I'm glad that no one has brought any rotten tomatoes because the the answer that I'm going to give is is probably not going to be exactly to your liking, but give me a chance to justify why. Um What we're going to do today is talk through uh these couple things. I'll introduce myself, introduce the company that I work for, and then um we'll get in into the topic.
I'll probably rip through the the slides pretty quickly and hopefully give some time for for Q&A.
Um but before I do that, I'll introduce myself. My name is Phil Hetzel. I lead the solutions engineering team at BrainTrust. Um BrainTrust uh I'll I'll go over that in a second, but uh the solutions engineering team is basically um I think there's a there's a seat right there. Yeah.
Um solutions engineering team is basically the people that allow our customers to get the most value out of the platform as quickly as possible.
Uh prior to BrainTrust, I spent 12 years in consulting and systems implementation. Um my last role in consulting was leading the global Data Bricks business unit at a company called Slalom Consulting, and I noticed that a lot of my customers were really prolific at creating generative AI proofs of concepts, but not nearly as good at bringing those proofs of concepts to production. So, I started using BrainTrust as a user first, and I liked the product so much that I applied for a job, and I've been here for about a year uh since since that had happened.
Outside of work, I like to play chess, but I'm not very good at it, and I like to spend time with my wife and my dog Saint Pistol Pete. Uh he's he's pictured over there. He's the one in brown and not the one in black. The one in black is me. The one in brown is him.
Uh what is BrainTrust? BrainTrust is a um agent quality platform. The way that we perform agent quality is uh two different pillars, evals and observability.
Evals, those are the things that you're doing um in experimentation as you're tweaking and building your agent to become confident in your agent's execution once you push it to production. Observe agent observability to us means that once it is in production, that you remain confident in its execution once it's confronted with real usage and real users. There are other some uh other ancillary things that the platform does, but in general, that's what we do. Uh I'm really not going to talk about the product today. If you're interested about the product, uh you can find me at at the booth downstairs, but other than that, very happy to get in into the content.
Um just some observations that that I've seen uh through through the last year of of watching some of the top uh teams building agents across many different industries.
I think there are there are like two different types of organizations that we work with. There's the traditional enterprise, and there is the AI natives.
Traditional enterprise approach agent uh agentic development a little bit differently than AI natives.
Traditional enterprise, the uh a person in charge, a person of note, CEO or CIO will read something in the CIO or CEO monthly magazine that says that they need to be building agents.
And then they'll tell their delegate that you need to be building agents because that's the thing that's going to take us to the to the AI promised land.
And then that will get further delegated to an existing ML or data science platform team who already have a lot of the tooling in place, and since generative AI has AI in the name, it's a pretty natural fit to hand over that capability to an existing AI or or data science team. Is anyone kind of in the in that bucket today where they just got they were machine learning platform engineers previously, and they just kind of got handed generative AI because that's kind of seemed like the better best fit.
Yeah, got it. And I and there there's no judgment or connotation here. I just just something that that that I've that I've observed. Uh there's an a whole other set of companies that are that are more AI native that think less about uh what already exists because nothing really existed before generative AI uh started to gain popularity for these companies. In fact, they started building their entire offering around agents. So, rather than having an AI/ML platform team, they'll just have a a small team of engineers that are um agile enough to grow with the times. And um you know, rather than having very specific segments of things that they do, everyone is is very much cross-functional across both product engineering and AI engineering.
Uh the other thing that that's interesting about these AI natives is that since it's since these are typically smaller companies, that each person has more proximity to the problem. I.e., they have a better understanding of what the end agent is actually meant to solve.
Uh two different uh differences uh between traditional ML and generative AI. The model's already built. So much of of of what data scientists and machine learning engineers is going through that data pipeline of training a model. What do we do when the the model's already built? And the other interesting thing other interesting nuance, I should say, is that if you want to add value to these models, then you can add values not necessarily with feature engineering, but with natural natural language, which could bring in a different skill set to the to the conversation.
So, just to make this uh more clear, this is like I know that it's so much more complicated than that, but abstracted to a certain level, this is kind of what data scientists and machine learning engineers do. It is a data pipeline of training and testing, making sure that you're not over fitting, eventually deploying that model where it can be used by some downstream product team. That it that is that is kind of what data scientists and machine learning engineers are used to doing.
This, though, has already been done.
Anthropic and OpenAI and Mistral, they've already done the the data process of um grabbing that data, putting it through the pipeline, training the underlying LLM, and then deploying it through an endpoint so that their consumers can use it. Um the one nuance here is that instead of you know, while Anthropic and OpenAI and Mistral will be doing testing of their own, we still need to as as uh AI teams, we still need to perform evals after we've implemented uh those APIs in into our product. That's very That needs to be very important to us. That's really the only nuance here between these two images.
Um how do you change the This is kind of going uh to to these two differences before, the second difference. How do we change these predictive applications before and after? Uh traditional ML, you you either add more data to retrain it, or you're performing feature engineering to adjust how the how the underlying model is performing, and then you're performing a lot of AB testing to understand how your model is how your model changes have provided lift or not provided lift. Um with generative AI, you can since the model is already trained and irrespective of performing any fine-tuning on that model, which is pretty rare. Um the way that you can change that behavior is just by changing the inputs, the prompts, the context that you're giving that that model. So, there's a lot of folks that will be performing con- context engineering on top of these models, and uh those folks could have a better understanding of how real users could be using that agent.
They'll have closer proximity to the problem.
Uh so, I'm going to make the case for and and against agents belonging to to data scientists. Uh let's say I was making I was debating uh the position of it really does belong to data scientists and traditional machine learning engineers.
Uh agents use models.
In our organization, models are got are governed by data scientists. So, a data scientist will have a lot of underlying knowledge about how neural nets network and and thus how LLMs work. Um because of that, they're going to have a far better appreciation of the risks inherent with using this very complex technology. Um other other thing here is that they'll have very rigorous processes to push models and model assets to production. They will understand some type of testing process that they can use to keep the company safe and make sure that use end users are getting the um the the experience that they need.
Um and then number three, very related, just very rigorous mindset around testing.
The counterpoint to that is again, models already built, so we um we don't necessarily need to do any training and testing.
Um uh entirely different different pipeline. We're not doing like the whole cross-validation dance. And this is probably the biggest argument.
Um does an AI or sorry, does an ML engineer or a data scientist really know what they're testing for?
One of the things that I've noticed with some of these teams is they will really lock on to the traditional ML engineer metrics like precision, recall, F1, um and they'll obsess over those metrics because that is what has gotten them there up to that point. But when you're analyzing agents, it is far broader of a surface area that you need to be evaluating. You need to be evaluating the functional uh performance of that agent rather than just a technical performance across that two box um that we're that we're tra- traditionally used to it working with.
Um so, argument here is let's say that I was arguing that agents belong to non-data scientists, which could be both technical and and non-technical experts.
We could make the case that LLMs are just APIs. Product engineers are very used to using APIs as they as they build applications. Um, that is a massive part of what they do is reaching out, grabbing information from another system based upon some payload, and bringing that information back in a way that's useful to the end users. That is a that is a thing that product engineers do.
Um, the other thing about agents that is unique is that if you have a very complex agent, it could be running across many different types of compute if it's if it's some distributed agent, i.e., you have a supervisor agent up here and then it's calling different uh child or sub-agents that might be running on different infrastructure. And as they're running on different infrastructure, they might be calling different systems as a result. That's a that can be a very complex systems problem that might not be up the alley of a uh of a someone with more of a a statistics or math background.
And then finally, um, more on the on the non-technical side, it's really valuable to have uh subject matter experts or product managers be able to control the actual prompts that we're seeding the agent with. These people are the ones that have the most proximity to the problem that the agent is trying to solve. So, there's a lot of lift in having a non-technical person have a lot of say in how the agent performs. Not only that, there is a um there can be a lot a very large human annotation workflow that goes into making great agents, where as you see these interactions, if you are a non-technical person but has a lot of domain expertise for how the agent is supposed to be performing, uh that non-technical person can look into an agent trace and describe whether or not the agent is performing well or not performing well and most importantly why that's the case.
So, where I'm landing with all this is not that all of the people that raised their hand that says, "Yes, I'm proudly a data scientist or ML engineer." I'm not going to stand here and say, "Well, guess what? Like, you need to completely um refresh your your skill set." Um, that would be a very uh silly thing to say cuz I kind of figure there would be a lot of data scientists in the room and I'm I'm at least smarter than that. But, um, it it does make sense to have a very diverse team when you're when you're building these platforms. Um, it makes sense to bring both non-technical and different types of technical people into the fold. How can data scientists add value to building agents? Um, a couple different ways. Al- also, all these ways are irrespective of actually helping to build the product itself. I think that's inherent. Um, with with the tools that we have available to us now, it's actually quite easy for us to be able to add value to a product even if you are not coming from a product engineering background. But, what I think is really valuable is data scientists can add um to to use an overloaded term, add the guardrails to this process. A lot of people are very aggressive in in how they implement LLMs. They don't understand how the underlying technology works. They don't come from a stats background. I think data scientists can can be the adult in the room uh during those situations and say, you know, the the LLM, this is how it's trained. It's just predicting token after token. It doesn't actually know anything really. It's just a bunch of stats problems at the end of the day.
I also think that uh LLM as judge is a huge part of the eval process when you're building agentic applications. Um, again, people are very tempted to just believe LLM as judges when they're performing evals.
They're just prompts and models at the end of the day and it's very easy to be able to create some labeled data set and perform the traditional recall, precision, and F1 style metrics on those, which um data scientists will have expertise to. And the last last one, this is the most technical one, of course. If you do need to fine-tune an open-source model very specifically to your use case, that's probably going to be like the most fun and and technical thing where uh data scientists and machine learning engineers can can add a ton of value.
Um, the ideal mix here, in addition to the like that to that top section, we want uh both product, application, and systems engineers to be able to implement those requirements into the product itself um that that the non-technical uh experts are are giving to them.
Uh we want to make sure that the systems that we're building around these agents, i.e., where the agents are executing, is uh such that it's going to lead to a great pro- um user experience. And then finally, and this is probably something that the data scientists can pitch into as well, implement actual eval and observability pipelines so you have that feedback loop of what happen what's happening in production and what is uh and what's happening in experimentation.
For non-technical experts, we want them to be performing a ton of human annotation and a lot of prompt and context engineering. They have the closest proximity to their problem. You need to bring them into the fold if you want to have a very relevant agent to your use case.
Uh so, what's next? Uh answer is always in the middle. Um, so I I hope I didn't fully insult half the people in the room today. Um, if if I have, then you can feel free to uh come to my come uh come to the BrainTrust brew booth and give me an earful. That's completely fine. Uh but the idea here is that ton of value for data scientists. Just make sure that you're bringing more folks into the room as you're uh as as you're building agents. Two minutes for questions. I know that we're keeping like a very tight timeline. Yes, sir. Um Yeah, really good. Um like the the the conclusion. Mhm. I I I I do have a question on the the framing.
So, you know, this I I view agents as a tool. Mhm. So, anybody in the organization could could you know, build and own an agent in theory.
Um, uh and then they may have more domain expertise in the data science than the the engineering side. Mhm. Um rather than thinking about who owns the tool, Mhm. you know, this is the machine learning team or that's the computing team. Mhm.
How do you how do you view it as thinking about it based on the problem that's being solved and seeing agents as a tool to solve the problem rather than the agents as the I think you're thinking about it I think you're thinking about it the exact same way.
It's a it's a product that a that a diverse team builds.
I think the mistake that I see a lot of typically traditional companies make is they say, "Oh, this is we're making another predictive model." And they isolate it to the ML engineers or data scientists and say, "Go build and go build these agent things." Um, I think we're we're actually thinking about it very very similarly. Yeah. Maybe time for one more question. Yes.
Yeah, I I really um echoed um yeah, the message that you're trying to say.
personally, but I I I I think I'm just curious about the um actually the tooling that you are offering about closing the loop Mhm.
I think that at least from my experience, the missing delta is the tooling to facilitate the inter- like in traditional machine learning with inter- like variation and the like. Mhm. Uh is that something that BrainTrust is like thinking about? Yeah. Like, yeah, like how to um make it easy for our domain expert to update the system Yeah, for sure. Yeah, there's a there's a lot of things that we do to lean into that domain expert persona. We do have a human labeling component as a part of our platform and we do have like a agent and prompt playground where people can experiment with their own prompts and send them to the underlying agents themselves.
Yeah. And um the are you the like how to what is the systematic way that you keep that evaluator up to date and also the system up to date and like understand when make I guess the error analysis like if the error is based on the evaluator Mhm.
The idea is that we we gather data from production to continually add to that offline data set that we're evaluating upon. And then, hopefully, we're gaining grounded data along the way where we can kind of self-check ourselves to understand um if our evals are aligning starting to align more to human agreement agreement or or not or if they're diverging.
Yeah. Okay, everyone, that's my time. I really appreciate the attention today. If there are any more questions, find me downstairs.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











