Spec-driven validation is a testing methodology for AI agents that goes beyond traditional test datasets by explicitly defining agent specifications including rules (e.g., discount limits), domain ontologies, internal terminology, rights and roles, and robustness requirements (e.g., handling typos and rephrasing). This approach enables security testing by identifying where agents are most vulnerable based on their intended tasks, and ensures tests remain valid across infrastructure changes by being independent of implementation. The key insight is that larger models are not necessarily safer because they have more attack surface and can execute complex instructions that smaller models cannot understand, making explicit behavioral specifications essential for reliable agent deployment.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
Spec-Driven Testing for Agents With A Brain the Size of A Planet — Steven Willmott, SafeIntelligence追加:
[music] >> So nice to meet you. I'm I'm Steve. Um I'm the CEO of Safe Intelligence. We're a company we've been around for 3 years.
Um we really go very very deep into ML um validation.
Um and actually we use formal verification techniques on especially we started out on vision models, tabular data models, bunch of other types of models. Um where we actually have the model available and we look at whole regions of the input space and see whether or not the test points that are there actually tip over and do the wrong thing um uh under perturbations. So that's where the company started out. We have a whole bunch of products in that space. Um and uh actually yesterday we we released a new product which is doing something analogous for for language models.
Obviously we don't have the language model. So what we're trying to do instead is is be very clever about how we generate edge cases and test cases.
So I I won't talk about too much. We have a booth so so come and come and talk chat to us um at the booth for that. If you've seen these ducks around, these are ours. If you didn't get one, I have a whole box of them here so feel free. They say think harder on the front so you can put that on your desk just to be reminded um about what you should be doing.
And today I'm going to talk about something which is very similar to what um what very related to what Phil talked about just now from Brain Trust. Uh we like what Brain Trust does a lot um and I think one of the inherent problems is like how do you actually specify what an agent is supposed to do?
Um so I think people are familiar with spec driven development. This is not going to be about developing code with specs. That's also very important. We do a lot of that in the company for the products that we build. This is but this is about how you specify what an agent or an AI system is supposed to do. And in ML you typically use a data set to do that. You basically have your data set and you run it all those examples and you look at F1 accuracy and things like that. And that's sort of telling you what you want the what the what you want the agent or the system to do.
Um but as as we'll see there's actually a lot more to it um when when you deploy things. So that's what the focus of the talk is. Like how do we actually specify what agents are supposed to do?
And I guess my my key starting point is this this seems like an obvious question like a smarter agent is a better agent, right? So if I have a smarter agent, I'm using a bigger model, uh it's going to be better at doing the job that it's supposed to do. Um in in general you'd expect that to be right. But that's I think probably most people have the experience that that's not always true in fact. Um so there's some problems. Um if you're familiar with uh the the book The Hitchhiker's Guide to the Galaxy, there's a robot called Marvin uh who has the brain the size of a planet and he's normally asked to do things like make the tea and he gets extremely bored and he's extremely depressed. So this depressed robot uh kind of is a theme in the book.
Um if you haven't read the book by the way, you absolutely have to read read the trilogy um which is five five-part trilogy. That will tell you something about the the style of the humor.
Um in any case, Marvin there are challenges with having like massive models. Some of the jailbreaks actually uh work better on large models because uh they're smarter. So if you encapsulate something in a poem and you give that to a relatively low-end model, the low-end model doesn't even understand the poem. Whereas a larger model will be like, "Oh, I can take this out and I can execute the bad instruction that's wrapped up in the poem." So it's not obvious that um bigger is safer and it's not obvious that bigger is better.
Another thing is if you're building agents that have a very broad remit, they can do a lot of things. That creates a lot of surface area for someone to actually exploit. And it creates a lot of surface area to test if you want to be sure that the agent is actually doing things that you want it to do.
And obviously there's a cost issue, right? So if you're using large models to do something which is relatively simple like like just simple math, you're going to be paying for tokens and it's going to be slow um rather than something that's very optimized. So in general, if you're building agents for deployment for especially automated use uh fully automated use, there's this trade-off between smart and safe in some sense and um smart and and capable in the other direction.
And so what you're really seeking is like a model uh an agent that's built on a model that's good enough to perform but it's not capable of doing arbitrary harm. And that arbitrary harm is uh kind of two parts to it. One is, you know, what kind of instructions can it receive? How flexible is it about how those are formulated? What the what the prompts look like? Um that's one part of it. And the other part is like what tools and tasks can it carry out in your infrastructure? So if it's able to wire millions of dollars to people, that's obviously a lot more risky than if it can just answer questions and so on.
Right? So this is the balance that most people that you're basically looking for.
But how do you actually define what goods looks like? I think it's pretty obvious that it's not just a data set of inputs and outputs that that are pretty good and then the rest is like guesswork.
Uh and it's also sometimes hard to define what harm looks like because maybe an agent doesn't do the right thing um but it's it's kind of just failing at the task and sometimes it's doing exactly the wrong thing was asking to do something bad.
So what is this idea of spec driven validation? Um spec driven testing. Um you can call it that as well. Um it's basically what are the things we would want to do if we were just designing the role or the task benchmark by itself like independent of the agent.
So we already talked about data sets. So the ground truth having a bunch of examples of what good looks like is one thing. So that's kind of one component.
Often we see uh customers that we work with also have rules. So if they've got a customer support agent, you want to say things like you know, don't ever give a discount more than 10%. Uh we don't allow refunds if um if you know, it's more than 30 days past the purchase and there are sort of these rules, right? So your alarm bell should be going off a little bit already because like how do you actually test for sure that a rule is never violated? It's pretty hard.
Um sometimes you also um have uh ontologies or dictionaries that are relevant. So an example would be if you're building an airline chat um bot that particular airline might only fly to certain uh might only fly to certain destinations. So that's the relevant universe of things you need to think about.
Um you may have internal terminology in your company that apply to your policies that no one else in the rest of the world actually knows about. So that's also part of the spec, right? Because if you're going to actually um build an agent, you will be building that into the agent. But if you're going to test it, you actually need to tell the testing system what these things are and what is a a valid substitution.
There's domain knowledge so you may have very specific, you know, scientific um finance agents, other things that are that need to know what terms are substitutable. So if you for example if you do substitutions on something like um uh you know, gross profit and gross sales for example, if you're sort of talking to an LLM generally, it might actually confuse those two terms. But in business they're very different things.
Um so this specific domain knowledge is relevant to testing as well.
And then we might have rights and roles.
Like the agent may perform differently if you're logged in, if you're logged out, if you have certain rights and permissions and things like that. And then the last one which is um pretty important is robustness requirements. So one is I've got my test set. That should work, right?
Um but it needs to work under stress. So in vision where we started out, it's things like can I detect this runway for the the plane to land on? Like can I detect it at sunset, sunrise, under a fog? Like and how much fog that can there be? How much can the camera shake before the thing doesn't work? And that's actually similar in in agents.
You know, if you're building a customer-facing agent, could typos disrupt it? How many typos disrupt it?
Like how frustrated will people get?
Rephrasing, how how stable under change are are the results?
And so really this is the point here is we need to go beyond the test set to have like task and role specific benchmarks that are for the agent itself.
Um and what do you then do with that?
Oh, maybe I already talked about some of these examples. But these are just examples of the kind of things if you had a product support agent. So we've worked with quite a few people doing this.
So you can kind of think of it as there's an in in LLM land, people have started to call the eval kind of the test set.
Which sort of makes sense, but I just think that the eval itself we have to think of going beyond the eval as well. There's this concept of an agent card which comes from the A to A spec. It's been in other things as around which like describes what the agent does. It's also relevant here. And then obviously there's all the context around this. And if you're a company deploying agents, you kind of want your your tests to look um like something that has these various elements that are relevant. That's a fair eval. And you want to build more and more these These look like integration tests if you're from an engineering perspective.
Um often some of these things are implicit, but you want to make them explicit.
So what do we what can you do with this?
So what we do with this in our platform, we do two things. We do security checks.
So we actually pull the the specs um that an agent is supposed to fulfill into security testing. Why do we do that?
Generally, if you know what an agent is trying to do, you know the edges of where it's vulnerable because it's going to be willing to talk about those domains that it's supposed to act in, right? So that's actually where it's most likely to be vulnerable. Second, the tasks it performs, it will have more power to act in the infrastructure on those tasks. Like if it's a banking agent or something like that, it will have be able to work in that area. So we that's a place you can pull things like this spec information in.
And then the robustness side is like to do its job properly. Especially the robustness side, can we vary the inputs and see how how much of a range it has in terms of answering the questions properly.
Um so we build a product to do this, but my my point here is not to shill the product. Um I think it's just something if you're testing agents in any context using any infrastructure, trying to like be explicit about the various bits of that are on this slide and bringing that together is is a useful thing to try to do.
Um from an industry perspective, I think there's lots of things going on, but just calling out two, I mean, there are there are a lot of prompt management platforms that allow you to be fairly elaborate [clears throat] about why this test exists and things like this is all useful when you actually want to generate um variants of the test because you want this context.
As I said, from the A2A spec, you've got agent cards, they're quite long, but here's an example of a skill.
Um you would also realize that even if you have this, that doesn't give you enough to actually evaluate the agent.
You still want to know, well, what what range of change could be is is valid, you know, uh for maybe in this case, what kind of people could be could the meeting be booked for and and so on.
Um I can talk a lot more about how it how hard it is to create variations within the sort of envelopes that a spec might create, um but I think just in general, my my point here is like as you think about evaluating agents, start thinking about not just the eval data set or benchmark, um uh also think about the task and the context for the task and how you how you capture that.
So, um hopefully we can make Marvin a little bit happier cuz he has the specs and he kind of knows what he's supposed to do.
Um um and then, yeah, specify the behavior of your agents. That's kind of the key thing to do here.
Stay independent of the implementation cuz often you may, you know, may you may be building in LangSmith or something uh or Vertex agents or or or so on, but then later on you may change to a different infrastructure. You actually want to keep those integration tests, all of the unit tests, and penetration tests and and run them independently.
And this is also a way to close the loop. So, part of our inspiration of thinking about what should go into a spec is like, what would you need uh to actually run the agent automatically, get the results, and then start to iterate and try to fill the robustness gaps that have appeared. So, it's like a backyard type of RL, it's not proper RL cuz you're not doing it on the model, but you're kind of like uh jury-rigging something around the outside.
Um that's the key point. Um where do we go from here? So, we're obviously building product around this, but I've I've been in computer science for a long time.
My last company, we did API infrastructure, so if you used Open API spec, I'm up I apologize. It's partly my fault. So, I helped write that spec way back in the day. Uh so, all about open.
So, we're thinking about like, how do you express these things in a way that you could just have in a GitHub repo, pull them into whatever tool you want to do, and then pull all the different pieces, and kind of just version the hell out of that stuff. So, if anyone's interested in stuff in that, love to chat. Um that's my talk.
Um come to our booth. We have a a game you can play. Uh if you play by four, uh you can win some of the LEGO prizes uh up there.
Uh you need a bit of knowledge, to be fair, or you need to be insanely lucky.
Um but yeah, that's my talk. Thanks a lot.
>> [applause]
関連おすすめ
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











