Harness Engineering is an agent-first approach to software development where the key engineering task shifts from manually producing every implementation detail to designing 'harnesses'—structured guardrails, context delivery systems, and verification loops—that enable AI coding agents to produce high-quality, aligned software reliably. This approach addresses three foundational limits: human time scarcity, attention constraints in LLMs, and context window limitations. Effective harness engineering involves making non-functional requirements legible to agents, just-in-time surfacing context, using LLM-as-judges for review, and systematically capturing human feedback to continuously improve agent performance.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Ryan Lopopolo - Harness Engineering: How to Build Software When Humans Steer and Agents Execute
Added:Thank you kindly. Uh we've only got a few sessions left this afternoon and then it's uh party time, I've been told. Uh and for you at home watching on the live stream, you can have your own party. Uh whatever kind of party you want to have. I'm not going to judge. Uh so next up we've got Ri Ryan Leolo.
Close. Sorry.
I'm so sorry. right from Open AI. I'll just get off stage cuz this guy give him a big round of applause. Thank you very much. [applause] [applause] Hello, AI Native DevCon. Uh, woo. Uh, I'm excited to kind of be in the home stretch here on this first day, which has been jam-packed and super fun. It has been super fun to be here. And I'm kind of excited today to talk to you about harness engineering, which is a thing that is kind of near and dear to my heart, kind of having invented the term here. Uh, and to me, the way that we go about working with these agents is something that fundamentally is brand new, and we don't really know all the good parts yet. Uh, but hopefully today I can walk you through some of what I believe the good parts of working with these agents are and how to be effective in your own code bases. uh to give a little bit of context on why you should listen to me about this. Uh back in June of last year, uh when we had just had the earliest reasoning models around 03 and the very earliest versions of codec cli, which is open coding agent, um I had an insane idea that I would try and get this tool to do my job. And at the time with less capable models, that wasn't true. I asked the agent to read my alerts channel in Slack and triage a page. It would not do that and kind of got myself into this operating mode of presenting myself as a tool to the model in order to empower it to solve problems, issues, and write code on my behalf and ended up in this very quickly accumulating snowball of effective use of this tool by giving it more and more powerful tools and more and more context around what it means to do the job. Uh, there's a bunch of patterns here that make that effective and stack really, really well for your teams that I'm going to go through today.
I know I'm preaching to the choir here.
Everybody's AI native. That's why we're at the con here. Uh, but the way we build software has changed pretty significantly in the last 6 months. I would say in December with the introduction of GPT 5.2, to Opus 4.5, we really reach singularity levels of software engineering and code production being something that these tools do insanely well. And this is a level of disruption that I think we have typically only seen once every decade here. The last one that I can think of is probably like the existence of the cloud as a tool to accelerate ourselves. And with that sort of like cadence of disruptive innovation, we have had a lot of time to internalize changes to our workflows and the way we go about building. But here the technology keeps in changing so rapidly with every point release of these models where I find myself very often having to re-evaluate my priors of what even is possible to achieve with these tools. And I think if you're not in the habit of kind of completely retooling your stack and the way you work with every point release of the model, you are in a way missing out on what it is that you can achieve with these tools.
The reasons that the way we have built software has changed and continues changing at an increasingly rapid pace is because we have kind of upended some of the core axioms of what it means to build software.
Right now I'm telling you that models are good enough in order to do significant parts of the software engineering life cycle. Not just writing code but debugging, triaging, responding to customers, planning, scheduling work, all these other bits that are outside of the core, would you say production function of a software engineer.
A lot of the way we have toolled teams and organizations and road maps have been built around this idea that the production of code is this very very expensive thing that is going to dominate most of our headcount resources and is slow. And in this world where we can give a prompt to a coding agent and get a PR or six out of it, that constraint is no longer true. And we kind of have these teams who were doing the bulk of the production for software that need to figure out ways to increase their leverage by delegating increasing parts of that responsibility to these machines.
So for all the software engineers and engineering managers and product managers and designers who are trying to incorporate this technology into the your work, all of your goals are to be how to unblock your execution team, these coding agents from being able to make your ideas, your vision, and your products a reality.
So having just told everybody here that the core constraints on software engineering no longer apply. What are those core constraints? Right? We have kind of a new set of problems to contend with using agents in order to produce our software. And to me these three things are the foundational limits that remain in a world where we are as a team of humans and agents producing software.
Human time is the fundamentally scarce resource that we have. You know, I know I max out probably at three concurrent sessions on my laptop. If I want to be more parallel and have higher throughput, I must find ways to remove my own synchronous attention from the process.
Human and model attention are these foundational limits, right? In the architecture of these LLMs, attention must sum to one. uh thrashing the agents by having them do more and more work with conflicting and overbearing requirements in the course of a task is something that is always going to degrade performance less and less over time but it is one of those core limits of the models. So, we need to retool the way we work in order to be more parallel, fork off a bunch more tasks, be willing to accept smaller or larger or many more PRs in order to let the agents explore what it means to do the job that we need them to do. And finally, you all probably deeply live this uh model context window.
Things that get bigger over time, still a scarce resource, something we need to protect. I will say in my own experience with the GPT series of models, autoco compaction is fantastic. I never think about a context window anymore. I can let a task go for 6, 12, 36 hours and still get good results. But the context window being obliterated and rebuilt over the course of these autocompactions is something you must contend with. And there are ways that we structure the context we give the model or continually resurface context to the model to deal with this constraint that context windows are continually being emptied and filled. [snorts] Okay, so we've got these agents that we hope can produce more and more of our software that we can remove humans more and more from the loop in order to produce more code, more features, solve more user needs, address more critical user journeys with higher quality and fidelity.
How do we make sure that we as a team with our agents do a good job? And I think a newish thing here is we have to actually articulate that. We have to write it down. Uh I know like it used to be the case, oh, you know, we'll have people, we'll have meetings through osmosis. Like people will understand what it means for us as a team to write highquality software to work effectively together. And agents just do not have that capability. They don't have presence in our standup. Uh they don't have this durable memory that accumulates context and battle scars over time. So we have to find ways to make all these nonfunctional requirements of writing good software legible to the agent. And as an LLM, the thing that it craves, the thing that drives it is text. So figuring out ways to take the definition of what it means to do a good job and write it down is a net new function for a software engineering team in 2026.
But it's not enough to just write things down. We need to make sure that this text is a thing that the agent will look at because it it doesn't do much to say you will write reliable network code by making sure that retries and timeouts are consistently applied if that text never makes it to the agent. So figuring out ways that not only we can write things down but also have them pulled into context at the right time in ways that don't thrash the agent and still lead it to be creative and reason which are the power of these models is the important thing.
So to kind of take a step back and look at some in the small instances of context and what it means to kind of like think systematically and close loops for the models and for your team.
If I were onboarding a new engineer to my team and we were I was reviewing some React code that they had written uh and I knew for this particular set of components uh we use uh suspense because that leads to better performance in the front end. I would be able to give that feedback once to the human and they would incorporate it into their mental model of the codebase what it means for these different screens to relate to each other well and I would largely solve that problem going forward by empowering my teammate to know more about what it means to do a good job but I can't really do that with an agent in the same way so I kind of have to step back feedback on an agent produced PR and then figure out a way to make these mistakes going agro was misaligned. How do I figure out where I can write it down? What lints I can have fail? What tests need to exist?
whether or not I can empower a reviewer agent to look at all the proposed diffs through the lens of these guardrails to make it so that this feedback is actually durably encoded as a static guard rail that we apply to every PR going forward. It's not enough to do point in time fixes with these models.
We want to make every mistake something that is just not possible. I never want to give the same review feedback twice.
And this is really the core of what harness engineering is. Harness engineering is making context around what it means to do a good job legible and then just in time surface to the agent over the course of its trajectories in order to steer and refine its output to make sure that every PR we get adheres to the golden thread of what we consider to be acceptable highquality aligned software agents that I would normally consider to be like good practice around DevOps and shifting left as far as possible in order to make things cheaper earlier in the process. I don't do that at all when working with agents. In fact, I try and put my interventions as far right in the process as I can in order to minimize my own synchronous time having to engage with these issues.
For example, if I'm working on a PR and I realize I get a bad result, it might just be the case that I'll trash it, change my prompt, and probably get something good out of it. But that's not really a durable thing. It's not reliable. I don't socialize those improvements to my team. So, sort of the next level of shifting that left is to write it down. And if writing it down is not enough, writing down and then empowering a review agent to prop judge every diff is another way I can shift that left. And then I can shift it left further into statically verifiable lints and guard rails and tests and on and on and on earlier in the process. And we think about this as needing to surface to the agent all those sets of non-functional requirements.
It is not the case that these agents don't know how to write high-quality software. They absolutely do. But as artifacts of their training, they have seen every possible permutation of every possible choice that goes into producing software. And it's up to us to prune latent space to tell it which choices we want to make. Um, you know, if I am using these things to prototype a new data science model in a Jupyter notebook, I have a very different set of choices I make in the production of those diffs than I do if I am working on adding a new index type to a database.
They're just fundamentally different tasks. So, it's up to us as owners of our code bases to make legible the sets of decisions that we make in order to produce our code. what it means for something to be a prototype versus production feature that requires a stage roll out with AB tests and feature flags. And if we write this down and give the agent some tools to reason about what type of changes being made to find runbooks that are appropriate to refining its output over the course of its PRs and epics, we can give it bounds and context, but still give it the space to reason, be creative, and cook.
One maybe non-obvious thing is that because the agents crave text, every bit of text that we feed them is in some sense prompting. It's going to inform what tokens get predicted, which means it's going to inform the code and the diffs that we produce. This means all the code in the repository of itself outside of the documentation knowledge base is also prompts. So if we think about aligning the codebase or unifying it all on the same patterns, we kind of limit the amount of attention the model needs in order to do a good job. If I am able to standardize on hotel across my entire stack, for example, when the model thinks observability, it's able to translate context that it sees in one part of the repository over to something halfway across the codebase without any loss of quality or intelligence. But if I have six of observability stacks in the codebase, the model's going to have to spend a lot more time figuring out which one do I use here. Is this migrated or not? What is canonically good? [snorts] So over the course of the PR, there's sort of three phases I think about when we're talking about context delivery.
And because we are curating the codebase in order to make it efficient to deliver context to these agents, we also want to encode that in the operating loop we give the model. To me, the most important thing that ends up in that agents.mmd file is a numbered set of steps that we expect the model to go through over every roll out that we do over every session. You know, we first want it to ground itself in the documentation knowledge base in the ticket that is proposed.
We want it to spider through our history of ADRs and design docs to figure out how this might impact other features of our codebase. We want it to look at the critical user journeys to inform itself around what screens and user surfaces are impacted so it can keep the QA plan in mind over the course of its execution. We expect some amount of slowness during this process because we want to page in all the context around what it means for this feature to slot in globally. Uh then there's sort of the messy middle part of the run where the agent is writing code, running tests, exploring the codebase and for that we exploit the fact that these agents are going to call a bunch of tools, run a bunch of tests in order to use them to just in time prompt inject the agent to steer its output back to baseline.
The tests we write, the lints we write for agents are very different than the ones that we write for humans. They by default recognize that agents are going to truncate tool call outputs that they respond really well to descriptive error messages that point them to runbooks for remediation steps. And we are willing to have very many of these things that are kind of fiddly to write and I wouldn't normally think about to go back to this sort of network code example. Uh I am sure all of you have been paged at some point in your careers around an outage that boiled down to a missing timeout and a retry on a cross- service network call and the collective amount of engineering time that has been spent on this very common failure mode is astounding. But still today like there's there's there's no code that asserts that we pass retries around. There's no like ESLint plugin that I can slot into my codebase is going to do this for me.
But because the production of code is very very cheap now, we can absolutely vibe a set of guard rails into place with 100% code coverage and exhaustive table driven tests and migrate the codebase all in one go and just in time surface this failure to the model every time it writes another fetch call and never have to worry about this again.
And because we don't have to pollute context window upfront and we can exploit the fact that a tool call output is going to be given less weight during an autocompaction, we just in time correct the model and still let it go off and do the complex work that we wanted to in our original prompt.
And then sort of after the run, we have a much easier task of determining whether or not the code, the diff, the artifact is aligned because it's a static thing and we have static sets of guardrails and can use very very many LLM as judge to look at the code operationalize it with a set or three of static guard rails. You know, this is what it means to write reliable code.
This is what it means to write performant React and make a determination. Is this good or bad? And if it's bad, why is it bad? Because the LLM's crave text, these LLM as judges can collaborate with the implementation agent over that PR thread, give more text back to the implementation agent, and further realign the proposed diff back to baseline.
So, we've got agents.mds where the context is during what types of work the model might want to look at that text, but otherwise not being prescriptive around any of the guardrails. We don't want to jam a ton of rules in here because we're going to chop up latent space too much. We're going to make it difficult for the model to spider through the codebase with creativity.
>> [snorts] >> I find it very very useful from this agents.mmd to point to a curated set of review personas uh that are essentially bulleted lists of guard rails. And I find this really really neat for an interfacing with the other humans on the team perspective because it is so cheap as a team to have a Slack conversation in a thread around what it means to fix that performance regression and then appmentntion the agent in it to say yoink all of this and put up a PR that adds it to our static set of guardrails.
So cheap in order to continually refine and improve the output of our agents in that way.
I also think it's really neat to take that same sort of pattern and apply it towarding what your product features are or what the critical user journeys are or why your apps even exist, what user problems they solve. All this context that we can give the agent helps ground it in what we are trying to do and why how our team thinks about working because all of this is going to produce more and more aligned output in that messy middle. We can kind of use tests on the as tests on the structure of the files on disk really blunt hammers around file line counts or whether or not snapshot tests exist.
these very very coarse grain tools in order to make the model do what we know is good.
Just requiring that every React component in our codebase has a snapshot test that gives a 100% branch coverage means that the model is naturally decomposing these things and making them pure where possible and not doing prop drilling and putting hooks close to where the data is used because that makes it easier for it to fill the requirement that there must be snapshot tests. Uh, and we can do this. We can assert this because it's free to produce the code that spiders through disk and matches up the snapshot test to the underlying component.
Another failure mode that I hear folks talk about a bunch is that, you know, these agents are doing type- shaped probing all the time. I end up with these nies or unknowns all over the codebase. And the way I've approached it is to just statically disallow any function that has a type of any or unknown unless it's parsing input in a route handler or from the database.
Other than that, with ESLint, we just ban the existence of that type. We require the codebase to be 100% typed, which means all this bad behavior and weird type probing just kind of falls out because we require 100% code coverage. These functions cannot possibly be exercised because the unknown types can't exist. And we get more line code, more line code that I would consider acceptable, high quality, maintainable, and all these other sorts of properties.
And having these failing checks tell the agent why they've failed and what to do instead means that it's able to self-heal.
Ultimately, as we move into that uh third phase of review and merge, we want to treat the model as if it's another member of the team and it needs to convince me to merge its code. Um I'm not shoulder surfing any of my teammates in VS Code or Vim, right? when they put up a PR and they attest that they tested the code, I take their word, you know, and if I am unsure, I'll ask them to show me the logs from the staging deploy or to post a screenshot of them exercising the feature in uh the app. And we can require these agents to do the same thing. Uh this is a lot easier these days now that we have things like computer use and browser use. Uh the codeex app is fantastic, highly recommend it. Uh but even without that, you know, vibing yourself up an XC connected headless display in a Docker container and wiring up FFmpeg to that stream to record a reproduction video is within reach because I don't care about how gross this code is. and Codex is able to sling ffmpeg better than anybody in this room probably [snorts] on the back half of things where we are looking for ways to accept the diff. I'm treating it again like I would my human teammates benefit of the doubt bias toward merge. What are the P2 and above things that would be necessary for me to accept this code? use the reviewer agents, which is really get the coding agent to pick them up, implement it, get the reviewers to be happy, and off we go. And this sort of process with me observing along the way of which review feedback is regularly getting surfaced. Why is it making it to this part of the pipeline, maybe I need to use that as a signal that I need to shift some of these guardrails left? And then I can spend my time and the reviewer agents can then spend their time on more bespoke or more complex changes that we need them to look at.
Another thing that you should be thinking about doing as a team is how to systematize capturing all of this human feedback. every review comment, every time you have had to interrupt the agent, every agentic intervention, every failed build, every exception in production, all of these in some sense are signals that context was missing to the implementation agent that it did not consider the full endto-end consequences of the code that it wrote and whether or not it would be successfully deployed.
And what we are trying to do, uh, which I expect you'll learn about in the next talk, is slurp all this data up and dream over it every night, pointing a bunch of sub agents at it, trying to distill whether or not there's anything that humans can do better in their prompting, whether there's missing guardrails that should exist in the codebase that disallow this behavior, and how we can get to a world where we're more and more headless, less human interrupt dependent, and able to trust the agent to do more and more complex things.
I think vibe coding is a big part of what it takes to be successful here because there's a ton of guard rails that only affect my local development process. This code can be gross, but it brings into possibility this idea that I don't need to care about parts of the software production function. This lets me operate like a group tech lead or an org lead where I don't have visibility into every single engineer's activity on the keyboard. But the thing I care about are invariance interfaces whether or not the components that they're producing do what they say on the tin with high reliability.
And uh with that I'll just leave it with uh y'all can go build things. These tools are fantastic. Go get after it. Uh I'll take some questions now. [applause] >> Thank you so much.
Thank you.
Pow, that hand went up really fast. Hold on one sec. Want to grab that one?
>> Hello. Uh, great talk. Um, you mentioned earlier in your talk that you find that you don't need to shift left as much as before. You stay more right. And I'm curious about that because uh isn't it better for agents um to see something in a lint rule rather than as uh review feedback for example like what what what do you mean by um staying more right and not shifting left? So I think once you kind of put these structures in place to surface these requirements to the models at the right time um it becomes pretty easy to rely on them for the most part to autodiscocover this stuff. Um, it is very often the case that our agents MD paints a picture of which guardrail files are relevant for which categories of changes backend working on the design system, these sorts of things where the models will just naturally page those sets of persona oriented guardrails into into context which means I very often just don't see patterns of misbehavior in that way. only if for example uh guardrails are commonly required over tasks that span 15 context windows and by then that context in those guardrail files has been autocompacted away then that's the sort of thing that I would use as a signal that okay this is the thing that I need to shift shift left further on but I I do recognize here that it is sort of predicated on making sure that like those autodiscocovery functions are things that are reliable Uh probably got time for one more here.
>> Oh yes.
>> Um is there a practical implementation of those of the harness you've just mentioned in terms of um some end to end u implementation of those um capabilities during before during and after >> I have started to bring some of these techniques to my open source uh work. Uh I uh used to long ago uh build a Ruby interpreter in Rust called Artichoke.
There's a bunch of crates uh out of that work that I still actively maintain. Um probably the most interesting one for you to take a peek at is uh randt artichoke randmt. It's a sort of mercen twister implementation. Um been doing a lot of fun stuff exploiting automations in the codeex app to basically take my hands off the wheel for a ton of the maintenance tasks of this OSS work. I haven't quite gotten to putting those review agents in place yet, but it's coming.
>> Any final questions?
>> Nope. Okay. Uh, big round of applause to Ryan. Thank you so much.
>> Thank you everyone.
Related Videos
LBF101 Creating an XML Changelog
liquibase7511
3K views•2026-06-15
Alta Labs Cloud Dashboard Real time Network & Xnet Insights!
ShinyTechThings
158 views•2026-06-17
Wait... Group Policy Not Applying? Check This First!
keeplearning_iT
144 views•2026-06-15
Leetcode Weekly Contest 506 | Life's boring these days
Pudeesht
2K views•2026-06-14
microJAM: MAKING A MICRO GAME FOR A GAME JAM IN CLOJURESCRIPT AND TOTALLY NOT C
janetacarr
156 views•2026-06-18
Partitioning vs Bucketing vs Clustering: How to Make Queries 100x Faster
thedataandaiguy
194 views•2026-06-16
Design Claude Code Like a Senior Engineer
hayk.simonyan
344 views•2026-06-19
Linus Torvalds: AI Won’t Replace Understanding Code
SavvyNik
140 views•2026-06-19











