Cody highlights the shift from raw model power to "effort-based" reasoning, revealing that even the most advanced LLMs remain frustratingly non-deterministic. It’s a sobering reminder that in the age of agentic coding, managing context is often more critical than the model itself.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
I Didn’t Expect This: Opus 4.7 vs GPT 5.5Added:
So, one of the things I've been meaning to do more on my channel is to benchmark different models against each other using my own type of test scenario. So, for example, I have an actual application that I'm building out, and there's a teeny little bug that actually took me many, many prompts to get working in Claude Code Opus 4.5 7, and I went to verify again if I had these two models neck and neck, which one can actually solve it with the lowest level of thinking level. So, that's kind of the the parameters we're going to be doing in this experiment. We have Opus 4.7 with 1 million context window on the left. We have GPT 5.5 low on the right.
Again, this is effort level low as well.
And in terms of skills, we have no skills set up on either of these other than the default ones that are baked into the project. So, the setup of this task is I basically took a older version of my application, and I downloaded a zip file, and I put it in a directory, and then I told Claude to basically extract it into eight different directories. And I'm going to be trying to run these different effort levels with Claude and Codex to see if it can solve one particular bug that took me many prompts to solve you know, a couple days ago. So, in this personal application I've been working on, you can basically create a new Claude Code session. And inside this session, this is a embedded XTerm inside of an Electron app. Now, the issue is that when I do a shift return, it just submits the prompt, right? There's no new line characters that should have gotten added to the terminal. So, that's the bug. It's a very simple bug, um but these models actually had a lot of issue trying to solve it. Now, in order to tune down the parameters of this scenario, we're going to pass the same prompt to Codex and Claude Code and see which one can actually solve it at which thinking level. Now, I will say the prompting that you add in can really make or break the solution. So, I'm going to just try to vibe code the prompt. I could probably try to do a plan mode, but again, that's going to kick into the underlying internal prompting of these tools, which you don't really know how it works. It's kind of like a black box. So, I want to kind of just do a prompt and see how well these tools in the models themselves can take my prompt, dive into my code base, and understand what and how this bug is actually um set up. Now, before I kick off the prompt, I want to say that I do have an agentsystem.dev website where I have all my skills, my agents, and the things that I use every day when I'm doing agent of coding that has made me much more efficient at shipping. I have found this to be a huge value in my day-to-day coding. So, go check out agentsystem.dev and join the waitlist, and when this is ready, I will send you all an email. All right, so going back over here, I'm going to just prompt this out. There is a bug when I navigate to the project page and load up a new terminal session. Claude Code terminal will load fine, but when I do a shift return, it just submits the prompt instead of actually adding a new line characters. Can you please debug and fix this issue? Okay, so I'm going to try to keep it very high-level, kind of like a vibe coder, and we're going to take this prompt. We're going to copy it. And we're going to also paste it over here into GPT 5.5 low. So, let's paste in that same prompt, and let's just see if it's able to figure out what the bug is and fix it for us. All right, so both of the low efforts are done. I'm going to go ahead and run my application with the Codex one first and see how it did. I'll make a new terminal window.
And then we will just type in some information and do a shift enter, and it still submits. So, Codex was not able to fix this in one shot with the low effort. We're going to go ahead and try it as well with the Claude solution.
We'll go over here. Uh let's go into Claude low, ampion dev. Type some stuff, shift enter, and again, this one did not fix it. So, neither the low effort were able to figure out what the issue was with. And of course, you could probably take this test further and test like, okay, well, if I were to keep on prompting these, at what point would it maybe figure this out? Um honestly, if it doesn't fix the bug in the first try, you should probably just switch to a higher effort. But again, this is more for just experimentation. But now, let's switch over to the uh medium project, which this one's already in, and I'm going to make sure that the effort level is on medium.
Okay. And then over here, we have Codex on medium. I'm going to say model.
We're going to verify that we're on medium.
And now, we're going to enter in the exact same prompt in the medium effort, and then over here in Codex, same idea.
We're going to enter in the same prompt for the medium effort of Codex. And hopefully, everything is set up. Just double-check. We have Claude medium directory, Claude medium or Codex medium. And we're going to let this run.
We'll come back in a second. So, it does look like Codex is done. So, let's run the Codex version and test this out.
Now, going over this is the Codex medium solution. Type some information in, shift enter, and it doesn't fix it as well.
So, we'll go back to the code and wait for Claude Code to finish. All right, so now Claude Code is done. Let's just try to run the project over here and test it out.
Type in hello world, shift enter, and again, that still submits it. So, unfortunately, both the medium efforts of Claude Code and Codex cannot figure out how to solve this bug. So, we're going to go ahead and just try it on the next level of effort. We're going to go back to Claude high. And this one will be Codex high.
Now, I will say there is something called caching when it comes to these models, and if you enter the same input twice, there's a high chance that's actually going to just basically give you the same results or give you something that's kind of biased towards the results that it just gave you. I do think that if you were to change the effort level like I'm doing, like I'm going to switch this to high, that the caching kind of is on an effort level by basis, I believe. A more proper experiment that I forgot to do is a you should prefix your prompts with like experiment three, and then paste in your actual chat. But since I don't want to kind of change what I'm doing midway through my experiment, I'm just going to go ahead and paste in uh this right here. So, this is the prompt again, exact same prompt we've been doing, and then also on the high version of Codex, we're going to kick off that prompt and see which one is able to solve this. All right, looks like Codex is done. I will say that the effort level match-up, it seems like Codex always finishes a little bit faster, but uh I don't know.
I'm not sure if you want it to finish faster in this scenario, if you want it to actually figure out the bug. So, let's just go ahead and run this and see if Codex managed to fix it this time. It seems like the output looks very similar to what it was in the last one, so I don't think it's going to be able to fix it in this scenario. But we could be wrong. Let's just go ahead and open this up, and we're going to try testing out the prompt again.
And it just submits it. So, that did not fix it. Now, while Claude Code is still running, I do want to test something. I want to clear out this thing, and I want to put a prefix. Like I kind of mentioned, I'm going to say experiment Codex high three.
And I'm going to kick that off. Again, this is prefixing your prompt so that any caching that's going on will probably be like overloaded because of how the models kind of work, but we'll see. Um I'm not sure how true that is. I need to go research and get more information about that, but that's just something that I read online, and that could also maybe affect the outcome of what it's trying to do and like the path that it goes down for solving the bug.
Now, I will say this is a bug that I've already fixed on main, like I fixed this a couple days ago, and the issue is related to XTerm. Like I'm using XTerm in my project, and the fact that Claude is actually diving into the XTerm library to look at the source code. I have a feeling it's going to solve it this time. I noticed that Codex did not dive into the library I'm using to try to figure out what's going on. All right, after 10 minutes, looks like Claude is finally done. So, let's just run this one and test it out.
And the Codex second run is actually done as well. Um Okay, so Claude high. Let's just go ahead and open up a new terminal, and we're going to test this out again.
And it looks like it fixed it. Now, I do think that the main reason why it managed to find this and fix this bug is cuz it decided to look up the XTerm library. And I think if I actually included that, I think we could probably get this to fix it in the low effort mode. Again, the context that you provide in is very important, and if you can just help guide the LLM to exactly what you think the issue is first, that's where your software engineering expertise comes into play. You're going to save one more tokens, you're going to save time instead of sitting here waiting for 10 minutes for the the LLM to turn around and hopefully get the answer. Now, let's try the Codex run now. Again, new terminal.
And it still submits it. So, Codex was not able to fix it on high mode even after the second run, which is crazy because when I first fixed this bug a couple days ago, Codex was the one that figured it out, and Claude couldn't. So, it's like it's crazy. These models are so non-deterministic that it's very hard to use them. Like one day you wake up and it's doing great, the next day you prompt poorly and it's doing bad. So, I guess to be fair, I will say Claude Code fixed it with a high effort level. Um could have done it faster maybe with extra high, potentially. I will give Codex one more shot. I'm going to go ahead and just switch over to um Codex extra high, and then I'm going to go ahead and just run it with the experimental higher four. Go ahead and copy this. Let's switch the model. Make sure we're on extra high. Okay, there we go. We're going to paste in the last one.
Change this to a four. So, let's just run this. I'm going to do a PM PM PM install over here, and then we're going to do the same thing with Claude just to give it another chance on extra high to see if it like does it faster.
Okay, extra high intelligence.
And then we're going to switch over to that prompt that we just did. I think I can just paste it in. And I will kick this off. We'll come back and check in just a little bit. All right, so it looks like both of these are done. Let's just go ahead and run Codex extra high and see if it fixed the issue. So, type some stuff in, shift enter, and it's still submitting. So, Codex was not able to solve this with any of the effort levels, unfortunately. Um but I do think this is a scenario where if you can't solve the issue with any of these effort levels, you should probably re-evaluate your prompt. And also, sometimes you should reuse the context window and just say, "Hey, like you've already tried solving it this way. Don't try to solve it again." Again, the test I'm trying to do is very narrow. I just want it from initial prompt, can you solve it? So far, Codex, GPT 5.5, every effort level was not able to solve it. Now, let's go over here and test Claude extra high cuz sometimes extra high can overthink, and it can get down another path of not solving the issue. So, we were lucky that it solved it with high, and we will make a new terminal.
I'm type something in, and we are able to shift tab as well.
Now, that's all I want to do for this experiment. I know there's a ton more stuff I could do. Like some things that come to mind is re-prompt it on low, but give it more specific instructions to, "Hey, go look at the XTerm library version that I'm using to see what the issue could be, and if I'm maybe using XTerm incorrectly."
Add that to both, and maybe low effort would actually get it right. But the power of the high effort reasoning is that it actually does all the thinking and the edge casing for you that you don't have to like remember to say, "Hey, go look up the library version to see if you can fix it." Cuz that's more of a very technical thing that a lot of live coders probably wouldn't know how to do. Another test could be planning.
We could have tried doing a plan mode and kicking that off to see how these these tools do it. But I do feel like up when you do a plan mode, that's just kicking in a custom MD file that's built in the cloud code or a custom MD file that's built into codex. And now you're testing not only the model, but you're also testing how well their prompting is for debugging. But I guess let me know if there's anything else you wanted to see me try in this experiment. Again, I want to try to keep it as narrow focus as possible. And I do have a a repo that I'm working on to automate this. So instead of me having to manually clone this repo and do it four times, I have something that does it for me and it just kicks off eight concurrent uh agents with the same prompt so that later I can come back and run them and test to see which one actually solved it. So we'll probably using that in the future in this this channel if I do more experiments like this. But yeah, that's about it. Go check out agent system.dev if you want to join my waitlist for when this is ready you guys can start uh using it. And then also I do have an entire course that talks about agent coding and how do you cloud code and cursor. And I have over 13 hours of content teaching you how to interact with these models to be successful.
Other than that, have a good day and happy coding.
Related Videos
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Instagram accounts got PWNed
EricParker
13K views•2026-06-03
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











