Codex goal mode enables autonomous agents to work continuously on complex tasks by providing a clear endpoint (master PRD), but achieving reliable overnight code generation requires a comprehensive guardrail system including strict TypeScript, ESLint with architectural rules, biome for complexity limits, knip for dead code removal, lefthook pre-commit hooks, and a three-layer testing philosophy (unit mocks, integration tests, and end-to-end tests) that prevents agents from bypassing quality checks.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Codex /goal Mode Ran 11 Hours While I SleptAdded:
I wanted to show you guys a new Codex goal mode that came out and how I'm utilizing it specifically because it is the the most fun I've had with Codex so far. I have a I have a goal. You see this pursuing goal 11 hours and 26 minutes. I have had a run running overnight for 11 hours and 26 minutes.
Uh for context on what I'm currently working on, something called agent runtime kernel, which allows me to securely run Codex cloud code pi and basically any agent harness in a secure micro virtual machine which uh ensures that these agents cannot infect my main machine.
They can't bypass or breach and be a security risk. I can safely run a lot of agent sandboxes on my machine on my Mac mini on my MacBook um without worrying about the agents doing crazy things with my actual data.
So keeping them isolated and secure, I built this whole thing with Codex. I'm really testing the capabilities of 5.5 extra high.
And my impressions on 5.5 in Codex with goal mode so far is is mind-blowing, dude. I'm I'm actually finally working less because my agents just take such a long time. My agents get to run in a reliable Ralph loop. Specifically, Codex gets to run. If you guys are familiar with the Ralph loop, goal is like a TED talk much more reliable version of that.
>> [snorts] >> And uh I'm going to show you guys a quick workflow on uh how I'm using it and what I'm using it for and just some insights that I've learned along the way cuz this is this is really good. Yeah, so first off, I I currently have a goal to implement these two PRDs fully, right? So if we go into agent runtime kernel and we go into the master PRD it's currently working on.
You can see that first off, look how long this PRD is.
This is a 1500 This is not a baby PRD.
Like I don't play with my PRDs, bro. Uh [snorts] So you can see that I'm working with a PRD that is almost 1,500 lines.
>> [laughter] >> Uh but for those of you that you don't know what a PRD is, it's a product requirement document. Uh typically, it's used for product managers to uh give to to give engineers a task. Uh in [snorts] this case, as the product manager for my agents, I'm giving them a very specific task that I need. And to produce these PRDs, I'll talk in Codex and I'll say, "Hey, uh like we need this goal." Like point being is that uh I have a specific feature that I need in integrated into my agent runtime kernel. We'll call it ARC for short.
I have a very specific feature or a list of features that I need implemented into my project, right? So, I have Codex research uh the code base and also do web search via Exa. And basically, for my request, I discuss and I work with it to generate a very detailed uh PRD on what we need.
Right? Uh basically, goals and non-goals, everything to keep it focused to accomplish the task at hand. And then, once I have the master PRD, which is a very detailed file, I'll then go into Codex and I will make a goal, and the goal is implement this PRD fully, please.
That's a full goal. So, what this does is it goal mode. Uh every time Codex compacts, goal mode will keep going and keep pinging and keep urging the Codex agent to stay focused on this task and to literally not stop until this task is fully complete.
Uh so, by having a master PRD that's very detailed and a goal saying to implement the PRD, every single time Codex compacts, it could just go and read the PRD, get fully up to date uh on like what it's doing. And Codex is really good at their compaction and their summary. I don't know how they do it so well. I've never actually read the compaction, but just the way it's able to continuously go and keep itself on track. The compaction by itself is already like really really really good, right? Basically Gomo summed up it as have a feature, have a goal, and more importantly have a solution like the whole point of goal is that there's an end to the goal. So, if you don't have an end to your goal, then your agents are just going to go on and on forever.
But in this case, what is the end of the goal? The end of the goal is that this PRD is fully implemented working into my code base, right? Now, in terms of how the agent could reliably code and not write [ __ ] code, the agent is not writing swap and can actually test the code and run reliable code and prove that the PRD is implemented. You could see that everything related to the agent is all in individual folders, right? Like I have my run time bundle and my render, my policies, my network, the kernel, the harnesses, um the core, uh the CLI integration, the authorization the authentication, the different adapters we have for cloud code, code X by. Just just know how well organized and well structured, number one, that the the the core foundation of the code base is.
It's very important when it comes to agents because agents go through your code base by grepping and searching, right? So, if you have everything nicely organized and structured and a clear separation of uh functionality in your code base, it's really easy for agents, especially code X, to easily understand and navigate your code base. So, that's number one. Most important thing is to have a very good structure uh [snorts] of your code base in general.
And now going going a bit deeper into it, every single one of my code bases will have ESLint, so I want to talk about the safeguards and restricted modes that are programmatically in my code base, which prevent agents from bypassing it and writing bad code. So, number one, I like using TypeScript and using strict TypeScript mode because it it makes the type strict. So, I don't allow the code to even build if any of the types are like weird or wonky and agents do really good at making sure that okay, types are actually respected and very deeply in check. I integrate biome into my project which is a very fast formatter. So, this is ESLint but for the basics. So, it makes um it's a it's a linter that basically ensures that we have really nice structured code, well-written code.
Custom ESLint plugins. What I use with ESLint specifically is I have my agents code custom architectural guardrails that enforce architectural decisions that I create at the start of the codebase. So, when I'm greenfielding or I will make sure to set up the codebase with custom ESLint, biome, and a very strict testing philosophy. In this case, when I make the design and I showed you guys earlier the architecture, right? And how it's set up. If you actually notice something, all these files are pretty small.
So, none of the files are over 500 lines of code. Everything is independent and this functionality this architecture like this design choice is actually enforced via code. So, agents are not allowed to complete their goal or complete the code unless it 100% follows the strict guidelines and guardrails I've set on my codebase. So, there's a couple of things I could go out over but I like having a centralized logger so that allows agents to run the code and get the best outputs from the codes. I have a custom harness that makes sure that agents can't skip a test. You know, agents are kind of sneaky. So, they'll try to go and they'll skip a test in order to pass it for the the pre-commit hug, but in this case, I programmatically deny them from even skipping a test.
Uh so, in biome, I use the third rules such as no excessive cognitive complexity, which means you can't have these god functions and functions and functions.
Uh everything has to have split and uh separated functionality.
So, like biome, and enforcing this like I just did prevents agents from even committing the work unless they break it down further and really make sense of uh how to make proper functions in the code base. So, as you start making these patterns at a foundation, it's easy for agents to build on top of them.
And then finally, we have um I have nip. Nip is just like it deletes dead code, so our agents always run nip >> [snorts] >> uh after builds to make sure there's no dead code, which could confuse agents if they're searching and grepping through the code base, so I make sure no dead code whatsoever.
And finally, we have the left hook pre-commit. So, if biome doesn't pass, if ESLint doesn't pass, and if the type checks doesn't pass, uh sometimes agents will try to be sneaky and like no verify it, but you can just disable that. Uh so, agents literally programmatically have to follow the guardrails and architectural decisions I've set up and enforced in this code base before they could even come back to me and be like, "Hey, I completed." No, like they completed means they completed, right?
>> [snorts] >> Um and then now the most important part is testing. So, the full test suite. So, I have unit testing, which are mock tests, super quick tests.
I have integration tests, where it spins up a real SQLite database, and actually integrates with the database, and then makes real database calls while the code base is running so that we could verify the actual uh database functionality that it's using in the code base. So, it's a great way to test migrations. And then I have real end-to-end tests where agents spin up the actual code base, uh set it up, set up the host in a Docker.
And then uh basically simulate a real product uh a real product environment or a real production environment on my local MacBook for development so that they could actually interact with the app itself. And Codex does great at manual end-to-end testing and interacting with the API and uh creating tests for itself. Now, a big problem with early agents is they would make really bad tests, right? Or they would just make these fuck-ass mock tests that literally don't do anything. And it's like, "Oh, if 1 + 1 = 2, pass this test." Like, no. That's a useless test. So, I created a testing philosophy where we explain exactly what we're mocking, what we're not mocking, what tests are supposed to look like.
So, we [snorts] have three layers, right? So, we have layer one, which are the unit tests. And then we have layer two, integration tests, which is actually hitting our database, actually spawning CLIs, actually doing operations. And then we have layer three, which is real end-to-end testing, which is, okay, actually run the code, test the code, talk to the agents itself, make sure the tools work, make sure it compacts, make sure the schedule pings, make sure the agent responds to schedule pings, etc. So, when we start making these tests as we start growing the code base and enforcing test passing before the agent even gets to commit, uh your job a lot easier as a dev. And these are kind of the guardrails that you really need if you want to start running autonomous engineering factories and not worrying about your agents making [ __ ] code because they they literally can't.
So, makes your job as a human and reviewing and doing PM a lot easier.
When we combine all of this with something like gold mode and code X and we give it a master PRD and we have all these guardrails in place and we give it the goal to fully implement these guardrails, you can start to see how these all work in conjunction to Oh my god, it just finished.
It just finished live recording.
So, it said that uh it updated both master PRD. Oh, no, it didn't finish. It's just updating the PRDs with the research it's doing as it did real end-to-end testing and found uh found problems along the way as it was actually developing. So, that's another great example. So, it's not finished.
It's still pursuing the goal. All right.
So, this is insane. This is great. I'm happy I got this on the on video. This is another example of how to really utilize gold mode and it just updated its own plan. Like this is This is insane. I have to leave my computer right I got to move this to my mini so it can just run 24/7 cuz dude, I could like go and just lay down at the beach and piss off while my agents just work.
>> [snorts] >> Uh but you can see here is the agent was doing real end-to-end testing and is writing its own reports and is using sub agents to do research and when it does real end-to-end problem end-to-end testing, it's finding problems that we couldn't anticipate. All right. Cuz a lot of engineering is is pursuing something or designing something, actually implementing it, running into a problem that you didn't think about cuz it was pretty impossible to think about and then uh rediscovering a solution, maybe going back.
Uh but because we have all those guardrails, it's able to have this feedback loop where it's going back and forth. This is This is like It's like auto research if you guys know about that, but on steroids.
On steroids. Yeah.
Um So, this is how I'm running my gold mode right now. Test results are going amazing. Uh if you guys have any questions, ask in the replies to the video. And make sure you join my Discord, my AI engineering Discord, which is in my personal site at butoshi.ai.
So far, this is insane. It is saving me a ton of time and allowing me to pursue even more projects in parallel because by implementing the tests and the guardrails in goal mode, they're running for hours, dude. So I'm I'm getting a lot more time on my shoulders.
Uh and in my life just because the agents the at least Codex specifically will pursue its goal. And it won't stop pursuing its goal until it's done.
And in our case, done means proper because of our guardrails. So I hope this gives you guys uh a general idea of how I'm using goal mode and how you can implement it into your own personal code base. And yeah, this is cracked. It's just This is disgusting. This is gross and bold.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











