AI coding agents like Codex require proper context, validation tools, and verification systems to maintain software quality; effective AI-assisted development involves providing agents with comprehensive context from multiple sources, implementing robust validation tools (linters, test suites, CI systems), and creating feedback loops where quality issues are systematically addressed to improve agent performance over time.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?追加:
[music] All right everyone, good morning. We are so excited to have Dom with us today. He works on Codex at Open AI focusing on developer experience and he's here to share more about how Codex is changing the way they build over there. So without further ado, please welcome me in joining Dom to the stage. Awesome.
[applause] Uh, thanks everyone for coming. I really only came up with like a better title for this last night. Um, so thank you for coming even though it was like a arguably vague title. Um, but I want to start off with a quick raise of hands.
How many of you still write at least like 50% of your code by hand? All right, that's not a lot. like 40, 30, 20, 10. All right. How many people don't write any code by hand anymore? Nice.
Um, awesome. So, why did my clicker stop now? See, there we go. It's back.
Um, so as you probably know, AI enabled coding has changed a lot over the last couple of years, and we've really sort of reached a point where code is no longer the bottleneck. You can write a lot of code, but how do we actually make sure how do we actually make sure that what is being built is good and we're actually shipping quality software? Um, because the models have gotten increasingly better. Um, but not just more intelligent, but also more capable.
And we're getting into this whole point where the models are really creating a flywheel where codeex plays a significant role in how we're actually building and training the models. It's contributing directly to its own development basically. And so we're seeing a rapid rise and we've moved sort of from here we go. I don't know why it's flaky. Um, and we moved from AI no longer just being autocomplete, um, or even pair programming like last year.
Uh, we now have AI that can actually work on complicated tasks for hours or days. Uh, fully bringing it end to end to production.
And with that, the bottleneck has moved from code generation or writing code to really making sure that your company sets up the right context validation and verification harnesses for the agent to actually be able to do real work. And so I want to talk to you about um a couple of learnings of how we and especially I use codecs at OpenAI um and how you can apply some of those same things to your own work. Because at OpenAI we have 80% of all employees use Codex weekly. Um and that's not just engineers. Um that's sales, marketing, finance, comms.
Everyone is using codecs um at least once a week uh for their work. And if you ask a lot of them, especially non-engineers, they can't give give it up anymore. Um but we still want to make sure that we're actually shipping quality software uh while we're doing that. Um, and so we had to learn a lot of things and adjust things as sort of AI was speeding up our development.
I'm just going to move closer here because this clicker is giving me.
There we go. So with that, how do we actually get the most out of coding agents like codecs um so that we're shipping quality software? Uh the first thing to uh think about is really sort of put your like starting from first principles. Like you wouldn't really hire someone and just like have them figure things out. We normally want to set someone up for success to actually like equip them with the right things to be able to do the job. We need to do the same things for coding agents. So we need to think about the context and what are the details that the agent actually needs to have to be able to do the job.
um how does the agent verify that it's doing the job correctly? One of the big things that makes an agent different to a traditional chatbot is that they are able to actually perform tasks and verify their own work. And then how do you actually make sure that what the agent did works um and that it did the right thing. I want to start off with context. Um uh because I think arguably this is one of the most important things to get right and I'm not talking about context engineering. So, I'm not talking about you meticulously uh generating a giant prompt with 20 different tools that are like putting things together so you copy paste this giant message in. Uh because you wouldn't want to do the same thing with a colleague either, right?
Like you don't want to go in and have to dictate your colleague every little thing and where to look and everything.
You're expecting them to learn and understand where to grab the right information. Um but then also arguably work on their own.
Um, and one of the things about Codeex is Codex is great at context gathering pretty much since its inception. We focused really on Codeex being good at navigating large code bases since we have a a large codebase our uh on our own and we're trying to make sure that we can really leverage Codex well. And so Codeex normally when you give it a task goes in and starts gathering context first before it's diving into building anything. So if you've ever used it before, you might have seen that like a lot of time initially before the first line of code gets written. It's sort of doing a lot of thinking, navigating files. And that's because it's doing the same thing that an engineer would do diving into a new codebase. It tries to understand what's the overall context that I'm working in.
What what are the files that already exist? What are features that I can use?
What are how is the uh codebase overall structured so that I can put the code in the right place and not sort of dump it into the first location that comes to mind.
But we can uh and so like what what you should think about is if you have a new engineer join the team, would they actually be able to write code up to your standards um purely by like jumping into the codebase? Meaning do you have the right uh context around what tools and libraries you're using? Are there the right uh pieces of information around how you structure your codebase etc. Um, but a lot of context doesn't just live in the codebase. When we build software as engineers, we normally spend more time um outside of the uh outside of the code editor, especially these days uh than inside a code editor. And so at OpenAI, we really try to make sure that every piece of context that that the agent could need to build a feature is accessible to it. uh which means in my case I have codecs connected to linear to um uh Sentry to our CI system but also to non-coding related systems like uh Figma uh Gmail my calendar my slack my notion everywhere where there's potential context about what we're building it has access to it and it can use that context not just for um for coding related tasks so In this case, let me actually show you the real version here. Um, last night I just like asked Codex like give me a rundown of everything I need to know about tomorrow's event. I didn't specify here like what the event was, etc. But instead, it was able to use my Gmail, my calendar, my Slack to get context on what event was I talking about and what pieces of information did it actually need to know, like what the location is.
It even figured out like that. Um, if we scroll a bit down here, it even figured out that like my ticket was in my email.
It uh figured out that I had been talking to Stu um from the temporal team about the event that I had to bring certain things. And [snorts] so it has all of this context that we can leverage um to continuously what I call like vague prompt. Um so this was I was I was in a meeting with my colleague Katya um and she reminded me that I owed her a blog post. Um, and I was like, "Okay, like Katya tagged me somewhere in Slack about like this dev blog post idea. I don't remember what it is. Um, I I have like a vague idea. Make sure that like you're connected to like what I published on X. Um, and like also some of my recent work. Um, and use my voice and my style." And because Codex has been working with me on all of these different things, it actually was able to figure all of this out. It found the right uh spreadsheet by looking through Slack. It figured out what the topic was. It looked up on my file system the different repositories that I've been working on. It knew my writing style from like all of my exposts that I had put in. Um, and so it was able to pull this task off and give me a great uh draft to work off.
[gasps and sighs] But you can also bring this context into the context. And this is sort of the next step um then that uh often helps me kick off things faster than ever before.
Um, so we made it a point to have codecs be available in our Slack, in our linear, in our um, GitHub. And so I can go and actually tag Codex directly in a conversation. If, for example, here a colleague had called out an issue in our documentation. I just ask Codex to look into it and update update it accordingly and Codex figured it out. Um, I didn't have to catch up on the context.
Everything was in there. And these days, I think some of my colleagues might be getting annoyed by this. Um, but sometimes you're just going to get a Slack message from me like that [clears throat] asking codeex to figure it out and it normally has all the context directly in the Slack message anyways.
Um, but one one sort of the third part of that equation that we recently added is um memories because codec shouldn't just have to like re uh like find these pieces of information every time. It should learn how you're actually working. And so we released memories in uh as an experimental feature a few weeks ago. And if you enable that, Codex can actually learn how it how it had to solve things previously where it had to find information and then can still use live resources like your actual CI system, your Slack, your email to actually pull the latest context in, but it will know where to look or how to debug that thing that it ran into previously. And uh on top of that, we recently added Chronicle as a way as an additional optional way for you to um not just monitor what is in your actual um codeex conversations, but actually monitor what is happening on your screen and learn from that so that Codex can learn what other tools are you using. So for example, in in this case, um Codex knew that I was working on uh a docs draft in a Google doc. So, it knew where to find that, look it up, find the actual source on on my Google Drive, and then also knew which Roman was I referring to when I asked it to message Roman later on, and it knew that it had to use Slack for that.
Great. So, now you gave Codex all of this content uh context, and it knows what to build, but how does it actually make sure that it's building the right thing? And if you've spent your time investing in great developer experience, this is finally going to pay off. Um because Codex will use the same tools that you're already using as an engineer to verify its work. It's been trained on verifying its own work. Um, and so if you already have, come on, if you already have some of these tools and you invest in that validation, you're actually going to be able to have uh these agents work for hours if not days um on on on hard problems continuously trying to make sure that they're actually reaching the targeted uh problem that you gave them to solve.
There we go. Uh, apologies for that. Um, but to give you an idea, like the most baseline thing are things like compilers, formatterers, test suites, and llinters that you already have on your system. But, uh, that doesn't have to be limited to sort of the traditional llinters that you might have used in the past. Um, a llinter can really be anything that helps the agent in a deterministic way to verify whether the work is done correctly. So for example for our developer documentation we have a llinter called veil that actually imposes our style guidelines onto the documentation. So if codeex is drafting new documentation we can make sure that um it can actually verify that it adheres to the style guidelines that we have defined.
Um, and with these tools, speed now matters more than before because if you've been an engineer and like you had thought about investing time to move from like one tool to another because it's slightly faster, um, is often sort of been seen as a nice to have. But if you have an agent that runs these things on a regular basis as it's going in a loop, these things can like accumulate and speed will matter more. So in our case for example we moved a lot of tools to uh to faster more native alternatives like moving from prettier to ox format which is written in rust to gain additional uh speed or moving from typescript 6 to the early version of typescript 7 that is written in go um because these things ultimately matter more than before.
Once you have these agents though working on these tasks increasingly for a longer time um you probably don't want to sit around and spin your chair.
they're actually going to start um paralyzing things, giving more agents, more tasks. And so suddenly you have all of these agents that have to verify uh that what they're working on actually works. And so you need to be able to spin up parallel environments and think about how can your agent actually run a full environment locally um or in some other instance if you have to deploy it you um but like basically be able to fully verify its own work while other agents might be working on other work trees or other checkouts of your codebase. So since we're temporal, that might mean for uh for example being able to spin up their own workers, their own dev server, and making sure that they're not running into each other and can verify uh verify their work.
We also have more advanced things like if you're working on UI, especially like being able to give the agent the ability to take screenshots, use accessibility tools, or even things like computer use to verify, click through the UI and verify different flows. Uh, in fact, for Codeex, we actually ended up adding a lot of these capabilities.
Come on. There we go. Uh, adding a lot of these capabilities directly into Codeex. Um, and please don't tweet this slide for 10 more minutes [laughter] because um, uh, additionally to the inapp browser and computer use which allows you to um, have Codeex work in the background on any app on your Mac, we're also launching in 10 minutes a Chrome extension which will allow Codeex to um, control your actual Chrome browser and spin up different tabs in the background. So you can continue to work in parallel, use your browser, but have codecs actually use that same context um to test apps, spin up multiple tabs, do research um and verify that things are working that way.
The other part with validation though is it's not just about sort of like giving errors and like Codex will continue to run the tool until there's no errors.
This is actually a great opportunity for you to teach Codeex and iterate and have it fix its own problems. So if you're having llinters or other CI tools that are um verifying work, make sure that they're actually providing helpful information back to that agent.
So once once you have all of that done and we uh have the agent write more and more code, you're going to ultimately end up with more more PRs. So, how do we review these PRs and make sure that things are actually good? Um, what so that we don't lower the bar continuously and let slop actually uh go past it.
Um, one of the big things for us is like everything, and this might sound ironic, starts with codeex actually reviewing the code. So, we're running um 100% of our poll requests through Codeex's review feature directly on GitHub. Um, and the general philosophy is that you're not going to ask your colleague to review a pull request until it's actually uh passing all of the review checks. And we made sure that Codex review is intentionally not noisy. So, you're not going to see it nitpick a million different things because we wanted to actually call out the important things. It's going to start with a blank slate. It has its own environment to verify uh hypotheses that it might have during review. And so it's going to highlight critical part uh critical issues for you um so that you can still rely on it without it being too noisy.
If you do feel like Codex is not catching certain things though, you can always set your own review bar. So for example with um with the developer documentation we noticed that things like typos weren't being flagged because realistically for a codebase itself it's normally not such a big issue as long as the code compiles and actually the tests pass. So um in our case though for documentation we do care about it. So we called out that in anything that is actual content, it should call out spelling errors and grammar issues as actual um P 0 and P1's. And similarly, what you want to do is you actually want to encode that as a feedback loop um in how you're building codecs or like how you're how you're building with codec.
So that if you're seeing things regularly getting past Codex code review or Codex is making the same mistake again and again, you want to build these things back into either your review or into even your validation tools so that Codex will make sure that um these things are not getting past it again. I also think CI flows are more important than before and especially um complete CI flow. So where speed might have mattered more in the past, now I think it's actually having a really robust CI flow that tests a lot of situations and that you continuously invest in to add additional testing if things actually make it into prod that weren't supposed to make it in. Um because ultimately the speed part doesn't matter since you can have codeex monitor its own PRs. Um so like what we what we often do is we have babysitting skills where um once Codex actually finishes a PR we push it up and we ask Codex to keep an eye on it. It will see um whether a CI actually succeeded or not. If it runs into issues it's going to go and fix it. Um, and then keep being on that loop until everything passes potentially even like I've had it uh keep an eye on like code review comments and like continue to fix those. [sighs and gasps] Come on. Here we go. Uh, one of the things that I thought um, I didn't get as obsessed about, but that has become like a non-negotiable for me with any uh, like especially front-end web uh, web app that I'm building is deploy previews. So, being able to verify that something works without ever having to check out the system locally and uh, test it there has become hugely important for us. um and drastically sped up the development to the level where now I can go and tag codeex in an issue that was found on Slack, have codeex fix it, pu uh push up a PR, I review the PR, I can open the deploy preview, um ask a colleague to review it and feel confident about that merge all from my phone without ever having to like grab my laptop. [sighs and gasps] And even if you're not having deploy previews, um thinking about ways that you can actually verify the work that you're doing without having to run the branch yourself is becoming increasingly critical. Um so for example, for the Codeex app, uh it's not really a thing that you're going to have a deploy preview for. So, one of the things that has become a standard is if you're doing any UI related uh changes, you're actually uploading a video or or at least screenshots of what has changed so that other people can quickly review that and get a feel for what is changing. Um, in fact, a lot of times we actually have codecs record these as well so that we don't even have to deal with that and we just have a skill that knows how to spin up the app, test it out, click around and record a video while doing that so we can review it um and feel more confident about it. But we then still have subject matter experts actually review the code using um um using code reviewer um gates to make sure that folks that actually understand that app are reviewing it.
Last part is all of these things help you increase the quality of what your coding agent is in uh creating but it's also important to maintain that quality because um you know you can build anything you want now [clears throat] but with that you might quickly run into a place where you're having bloat and not just like solid PRs but still overall adding to bloat of of the features that you're building. Um, and that's sort of where we've been really seeing the hardest challenge being restrained. Um, so we regularly build a lot of features first. Um, test them internally, get a feeling for it, might even try different approaches, and then when we feel confident about something, then we're going to have a debate about whether um, this is truly the right way to put it in, we're going to ship it, but we're also not hesitant to actually remove things again. Um, so we've uh drastically changed things around. We're iterating faster than ever before. And in part that's because we can um now ship not just features faster, we can also refactor things faster. So after we launched the Codex app, which mind you is only 3 months old um and has had like ships every week. Um, we've had several refactors since and like the benefit of this is that we can have one or two people work on quite drastic refactors of how state management works or other things like that. Um, while still developing other features because we can and without having to do like a major code freeze for like two or 3 weeks because we can um quickly incorporate the changes back in and iterate on things. The [snorts] other part that we're doing um though is to then take the the learnings that we have and continuously feed that flywheel to make sure that Codex gets increasingly better at things. One way we do this is as I mentioned before, we take the quality issues that we see and bake them back into the context and the validation tools that we need. Um but we also created things like skills to finish up a PR whether that is um for docs where we have a docs editor skill that will enforce the style guideline and a couple of other concerns that we have. Um but then also even on the on the codeex app site we have a finish skill that will go in and um refactor the PR before we open it to make sure it adheres to the best practices. So we often focus first on what is the feature that we're trying to build and then we have codecs uh go in and clean it up as it's closer to the actual shipping state to make sure that it still adheres to our our u requirements.
So overall you want to increasingly treat code as disposable. It's faster than ever before. It's always it's it's cheaper than ever before to create code and instead really focus on constraints.
It doesn't necessarily matter what a specific line of code looks like. It matters whether you have the right constraint set up to make sure that the code works correctly, that you were able to verify it, and that you're continuously iterating on codecs making uh on uh making sure that Codex can't make mistakes. Again, the question that I always get, and I know there's going to be a Q&A soon, so I wanted to take this one. Um first, where are we going? um I can't fully either reveal the the full road map or um talk about sort of um even where we're going to be in a year because things are changing so fast. But there is like three trends that I think we're seeing a lot right now that are going to continuously ramp up more. The first one is proactivity. It's really seeing agents increasingly do more things um without you actively having to prompt them to it. Uh whether that is like doing something in the future, whether it's um regularly checking something on a cadence or even doing uh doing things based on certain events like monitoring um like errors on sentry and like reacting to that trying to open pull requests. The other part is independence. Um, so I talked about a bit about some of these things already, but like between uh Chronicle, which can learn how you're using tools and how you work with things even beyond codecs, uh, and then computer use, which can control your computer and actually perform actions. we're going to be able to see codecs perform tasks um much more on like an independent level and uh work for a longer time without you having to like babysit it and bring things actually to completion.
And the third one is parallelization because as I mentioned increasingly you can run more things in parallel but you're quickly going to run into limitations of what you can do on your local machine and so you're going to see more move towards cloud environments or using dev boxes so that you can paralyze your compute and actually run more things um in parallel there. Um, before I jump into the takeaways though, I wanted to show you a couple of more of these examples. Um, just to show you how we're really putting this into practice.
Uh, this is another one of these like what I call like vague prompting. Um, so like this has become the the norm of how I how I post at this point or like how I prompt uh a lot these days. So in this case, I knew I was tagged in Slack about this thing and I didn't have the bandwidth to think about it in that moment. And so like when I got to it, I'm like, "Hey, Codex, can you figure this out?" And like Karen asked me this the other day. Mind you, we have more than one Karen in OpenAI.
And Codex uh went in and went through all of the things it had to like it went and found the right DM where I got that Google doc. It read the Google doc. It figured out what changes it had to do in uh the codebase. In this case, it was just an MDX change, but it can be more than that. Um, it fixed that. It then pushed out a poll request. And if you're looking at the time here of like, oh, it worked for 23 minutes. That's not the model working for 23 minutes, uh, per se. It was, um, like at least six or seven minutes of that was just like our CI running. Um, and so Codex was keeping tabs on the CI, making sure that the pull request was making it through the CI into the deploy preview. It pulled that and then it went back on Slack and it actually responded to that thread for me. Um, and so this really allows me to, this is the KA example, by the way, but this really allows me to, um, get more things done and run like five, six, seven agents often in parallel, working on different things by taking a lot of that context switching energy away from me and being able to for me to focus on the feature that is at hand and sort of take all of the distractions that often come along um, and and fire those off.
Um, this is one of those examples of like babysitting um, babysitting a PR.
This is a PR that I originally started on Codex Cloud, but then I wanted to like bring it in um, locally. So, I just asked like go fix this PR. I gave it the link. It went on to GitHub. It checked out the um, branch locally. It figured out what the issue was and then it made sure that the CI passed. So, it pushed it back onto GitHub. It monitored the CI and then made sure that everything worked. it knew that in this case like the versel part wasn't the issue. So it it decided like hey that's still running go and like um um like I'm I'm done here but like it's probably going to be fine.
But then in this case I'm just I asked it like keep an eye on it and like when the actual deploy preview is ready go and alert enory whether the work actually has been completed.
Um, the last thing that is slightly different, but I thought it was a fun fun little demo that I wanted to show you, and we're going to um sort of leave that running in a second in the background while we're wrapping up. um is I had Codeex actually combine the GPT image 2 model um that we released a few or I guess last week, two weeks ago um with GPT 5.5 to actually build a little um Windows 3.1 browser clone. And so I had it write up a plan and then it generated an image of what that should look like and actually built me that Windows 3.1 clone. Um, and we can actually pull this up here in the um inapp browser.
And um you can see it's sort of it's actually surprisingly complete. Like I can um ask it to do different things. Um but one of the things it has is mind sweeper. And I love playing Mind Sweeper. So um I thought this is a fun way to showcase the um browser use capabilities. So, in this case, we're going to have it interact with a fake in browser operating system from the '90s. Um, but this is really you being able to verify any kind of app um directly in the inapp browser or now.
I guess 5 minutes ago, it hopefully launched. I don't have Twitter right now, but also with the Chrome extension, for example, if you actually wanted to test things there. Um, so this is going to run. You can see already that's not my cursor here. Um, that's Codeex's cursor. Um, so it's going to go and like figure that out. Um, we're gonna have that run in the background um, for a second. But basically the takeaways that I want you to have is if you're doing anything um, right after this, think about um, the context, the validation and the verification parts. Um, one, how are you going to make sure that codeex has the actual context on what needs to be done? is everything actually accessible for it or how often do you have to paste something in a prompt for codecs to have the right context?
Uh how do you make sure that Codex can actually verify things and does it get the right signals back from it um when things go wrong? And the third part being how um how do you actually make sure that things are right and do you have that flywheel of taking every review that goes wrong or every issue that you're catching and fixing the underlying context and validation tools to make sure that codeex uh can't make that mistake again. Um awesome. Looks like codeex one. Uh so that's good. Um uh so let me jump back. Yeah. Um here's a screen. Here's a here's a slide if you want to take a picture. Um but yeah, think about this context validation and verification parts because ultimately you should give codecs the same affordances that you give your team and think about how you equip them to be most successful. And with that, uh thank you so much. I'm going to be around for a couple of questions.
[applause] Amazing.
Thank you. We have time for Q&A. So, if you're on the podium side of the room, raise your hand and I'll come over with a mic. If you're on the opposite side of the room, there is a mic stand at the front of the row. So, you can go up and ask your question. And the gentleman there, you are first to line to ask.
>> Uh, yeah. So, I was interested in the point you made about thinking of the code as disposable. Yeah, >> that certainly seems correct. But that begs the question, if the code is not the source of truth, the ultimate definition of the system, what is the source of truth?
>> I think I think realistically it still represents the source of truth in the sense that like it is ultimately what is implemented. But in terms of what should be implemented, specs can continuously be um like a crucial part of that. And spec like you can iterate on these specs as well. Like a good example of this is actually um we published a repo recently called Symfony. Um, and Symfony was a um tool that folks used internally to build um like essentially agents that like take sort of what I talked about but fully build the entire team to be autonomous so that like Codex would pull put up pull up things in like a linear ticket and move it around and like reflect all of that state. And the way we published this was we technically published an elixir um implementation but that wasn't actually the source of truth. What the team did is they took um a the internal implementation that we already had which was written in I forgot which language and we had Codex distill that into a spec. Um then we had Codex actually implement that spec in a couple of different programming languages. Um and the team continued to iterate on that to until it got to a point where uh the spec was really sort of that um representation of source of truth. But yeah, like ultimately I think between sort of the the validation tools that you have, right? Like you're whether it's the tests, the CI suite overall and the specs that you're writing like those become increasingly the source of truth and then you the ultimate details on how things look like. So for example in the in the example you just gave was the validation and verification driven from the spec which also drove the code?
>> Yes.
>> Okay.
>> All right. Thanks.
>> You're welcome.
>> Um I saw you using 5.5 for codeex. Yeah.
>> Have you from your experience using this? Are there different models that do better for different tasks or have you been just using one?
Um, so I think there is I would say with every generation of models like a slightly different way that you communicate to them. Um, and I think like especially once you're sort of like between different model providers, you're going to see more of that that difference. Um, so it really depends. I I don't think there's like a h hard and fast rule. I would say like overall we're seeing models get better um over time but they that might be reflected in different types of implementations. So for example 55 is incredibly good at paying attention to detail. So like in combination with GPT image, it means that the model can actually much more faithfully build UIs um using sort of that as like the baseline, but it not might not necessarily be better than 54 in terms of like building UI from the ground up without like a baseline.
>> And then just to follow up on the code disposable thing, when you were using or even building this, what is one thing that you've disposed of when you saw that it's not working out? Um, I think like a good example is like the automations page on on a Codex app was something that like went through a couple of different iterations even before we launched it and even after launch of like how do we want to actually like display information as the feature evolved like again like the whole app is 3 months old since we launched it. Um, and like I think we went to like two or three different iterations of how automations were working or work trees. I think work trees actually is the bigger thing. It has built-in work tree support. And I think we even post launch went through two or three iterations where we completely throw away old code and brought on like a whole new way of of handling work trees.
>> Thanks, Dominic.
>> You're welcome.
>> Awesome. And we'll take one question back here and then kick it back over to the mic stand.
>> Hi there. Um I'm wondering as codeex is running longer running you know multi-step tasks what are your new bottlenecks and how do you manage your time and manage the context switching you know when you prompt something you have 25 minutes to you know do something and then you had to come back and check back in >> I think the f the I'm going to start with the second one how do I manage my time I think this is one of the interesting things that might challenge some folks because I I think we're going to move increasingly to a world where um you have to show more like tech lead or engineering manager skills than you might have had to before as like an IC engineer where you have to think ahead of like how do I actually split up features how do I think about like structuring something so you can run multiple things in parallel and then similar how like a manager goes and like works on other things um and doesn't like hover over the person that they asked something to do the same way like you're going to start switching switching things up. So, I often have like um I think let me see if we can switch back to my screen briefly. Um so, you can see like I often have [snorts] seven or eight. Let's see if it's going to happen. Um can we someone switch?
There we go. Um, you can see there I have I have often like eight or nine things that are pinned on the sidebar that are things that I actively work on and then um I I will get back to them when I get back to them. Um, you know, and so sometimes things might hang around there for like a day or two and sometimes I'm going to come in and um immediately iterate on it when it's ready. And the reality is you're going to get to a point where um that might even be beneficial because you have this idea and you're like I'm gonna like work on this and then uh like you know a day later or something you're like never mind this is actually not that much of a good idea and like because it's now easier to do these things.
you're not you're you're going to be more honest with the implementation because you don't have that like emotional investment of like I've worked on this for like four four weeks cranking out like exactly how I wanted to structure things. Oh, that helps.
Thank you.
My my question is around something you mentioned is like let's say if I have um a certain protocol stack working in let's say Python and then I want model to kind of migrate it to a more performant language.
uh how do you manage that kind of a task where it's like yeah it's doing great at the small small things but like this you know 30 40,000 lines of code migrating to different language I I just I I want to say that oh this is the code this is how it's working right now >> migrate it to this language and how how would it break it down to multiple smaller task iterate multiple architectures and things like that >> so um that's actually a great reminder of a feature that I didn't show you I was going to show you. Um, so I I think the first thing is like you're going to get increasingly better at really articulating the goals that you're trying to achieve, right? Like in this case, for example, if you already have a codebase that has pretty heavy like test coverage, for example, that can become your source of truth of how you actually um want how you want things to be implemented. You can use plan mode inside codecs or write your own plans.
Talk through codec uh talk with codecs through things. Figure out like what are acceptance tests that don't live like that are not attached to a specific programming language. Right? So for example, if you're building a web server, making sure that you're generating first like acceptance tests that work in both languages so you can point it against different endpoints is going to be helpful. The other part um that we just it's still an experimental feature you have to turn on. Um it's called goals and basically it's right now in the codeex CLI. We're going to add it to other parts but you will be able to like write slashgoal and you give it a specific goal that you want it to work on and from that point on CEX will work until it's done with that goal. Um so the more specific you are with that goal the longer it will work on it. So this was an example where I told Codeex for our developer documentation I wanted it to work make the Verscell build for this at least 50% faster right like this is a very clear goal that is measurable and so CEX in this case only needed two hours because I was able to find some like big wins that I knew were somewhere there but I didn't have time to think about this and so Codex would go in and actually verify this right it goes in deploys to Verscell you can see that up here it like actually deployed to it, made sure that like is it faster? How much faster?
And so you can have these things work for like 20 hours, you can have it work for I think I've seen some people currently running this for like days. Um like someone straight up flew from home to San Francisco um and came back and Codex was still cranking like 3 days later. Um so like this is going to be helpful for these type of tasks as long as you have a very clear articulate goal of what you're trying to achieve.
Thank you.
>> You're welcome.
>> Hi, thank you for the talk. Um, I assume most people have heard of uh token maxing in this room and I'm kind of curious um when you think about your workflows and processes. Uh, does the word like token count come into play or does that mostly come from you trying to optimize the speed or the effectiveness um of your workflows that that sort of just takes care of itself?
>> I mean, I'm I'm I'm going to be fully honest here. I have unlimited tokens, so that helps. Um U but I like, you know, we I I still empathize with with folks having to like make the most out of things. Um, so I think this is like one of those examples where things like memory and other things are going to become crucially important because like ultimately you still want to focus on outcomes. Um, and I think the biggest win that you have is to um, leverage codecs to be introspective on how it had to solve things before so that you can continuously improve things. So for example um in the like codeex app code base like the agents.mmd file is not incredibly long but it's been dri uh like it's been driven by codecs based on past things where it ran into issues or we had to reexplain things and so like I would I would ultimately still always focus on outcome and then see how codeex can help you improve getting there next time faster um either by building better tools or providing more guidance. Um ultimately I think that is the much more reasonable approach and that helps you then um focus your time on how you want it to be done either by using something like low reasoning and sitting next to it and talking or being like me where I want to be as async as possible and so I run everything typically on extra high and I'm just going to move on and do something else. [snorts] >> Cool. Thanks.
>> You're welcome. All right, we have time for one more question back here.
>> Thanks for the talk.
I'm wondering uh how do you think about your context flywheel? I think you mentioned how codeex can work into your CI and like figure out what's happening come back to the session. How does that relate to when everyone is building something like codeex when you have so much context coming in from different people? Could you talk more about how you guys work with that if if you guys do?
>> Um I'm trying to see if I understood your question correctly. Um are you talking about like how the context window gets managed or >> more like uh there's obviously a lot of developers working on this at all times and there's a lot of context from each developer that they're working on and there's probably a lot of spec documents that go along with it.
>> Yeah. But there's also context that I'm guessing codeex will encounter as it's going through and debugging and how do you guys manage that context like through bugs or it's its own reasoning? Yeah.
>> Yeah. I mean, I think that this goes back to what I was saying earlier, like like Codex from C like for a while has been really focused on like being able to discover context as it goes and learn how to discern what is important um and how to like efficiently navigate a codebase. So like a good example of this is from like GPT 5.4 4 to GPT 5.5 the token usage has become drastically more um efficient and that's because codeex has learned increasingly how to deal with some of the problems how to navigate code bases um how to pull additional context and so I would say from that lens like um it's pretty good at navigating additional context but we're trying to not overdo it because the bigger issue is actually not the amount of context it is conflicting context so it's very easy for us to like this sort of especially a problem if you're moving from a different model like Codex is incredibly good at instruction following but if your instructions have been contradicting each other in text um it's very easy for Codex to go off the rails and so we try to still minimize that amount of like context so as I mentioned our agent MD file for example is not that long and we also open typically codecs inside the project that we care about rather than sort of like in an overarching monor repo or something so that Codex has like a good starting point of like where it might want to run into things.
>> Dom, thank you so much again for your time and coming here. Let's give it up for Dom again.
>> Thank you.
関連おすすめ
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
AI Doesn't Create Bias — It Inherits It
UXEvolved
176 views•2026-06-01
Distributed Inference Challenges Explained #shorts
alexa_griffith
466 views•2026-05-31
Starting & Test Driving JAKE'S Abandoned BUS from Subway Surfers | POV Restarting
RestartGaragePOV
4K views•2026-06-04
Building the Future of Voice-First Sovereign AI: Sarvam & NVIDIA
NVIDIA
3K views•2026-06-01
Tokens Turn Data Into Knowledge | Official Keynote Intro | GTC Taipei at COMPUTEX 2026
NVIDIA
2K views•2026-06-02
PoE2 Return of Ancients: Can AI Spark Stormweaver Finish Act 4? Ep8 LIVE
RealAsianRobot
249 views•2026-06-05











