This simple prompt change effectively bridges the gap between AI's raw potential and actual engineering reliability. It highlights that even the best models still need human-imposed rigor to stop guessing and start verifying.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
I Found One Prompt Change For Better AI Code
Added:Hello guys. So, I've been testing LLMs on my channel for quite a while now.
This is the benchmark leaderboard. And I realized that with one change in the prompts to almost any project, I could kind of force the models to score 20 out of 20, but with important caveat. Side note, I'm still traveling. So, that's why with this microphone in the hotel with weird lighting, but I still want to shoot this video because I think I discovered something interesting. So this realization, this is what I want to show in this video based on two models, Deepseek V4 Flash and Pro, which I was trying to retest again trying to understand why Flash scored better than Pro. So let me show you what I found. So you need to understand how I evaluate those 20 points. There are four projects and I make five attempts with the same prompt on each of those projects. And then there are evaluation tests which looks something like this which I copy from elsewhere. So the model doesn't know about those tests which then determine how many errors, how many times the tests failed out of five attempts. But what I noticed often happens with some models is that the models say that they delivered the task but they didn't actually test it automatically with their own tests. So in this case, DeepSeek Flash says that M plus1 prevention is ready, is in place, but there was no automated test actually checking that assumption. And then when I run my evaluation tests from elsewhere, they fail with for example this error when I simulate that M plus1 query scenario. So this is the problem.
Some models do not really write their own test and I do not enforce that in the prompt relying on the model being well self-sufficient to do that. But what if I do enforce writing and running automated tests related to the actual task and prompt? So that was my idea and assumption I try to test. What if I add something like this to the prompt one line create tests and then ensure they pass. And as AI hyped people like to say that changed everything. So this is the Excel sheet for that specific project and I decided to relaunch five attempts as I mentioned with Deepseek Pro which had made two mistakes at the time on this project and flash made three mistakes. And look at the timing and the price that will be important. So it was 4 minutes and 10 cents per prompt for pro model and 2 minutes 1 and 1/2 minutes and 1 cent per prompt for flash model with some mistakes. Now if I enforce the tests two questions then will it score five out of five each model and also what will be the cost and the time. And let's start with Deepseek Pro. This was the fastest result out of five attempts and it was 4 minutes and 47 seconds. I remind you the average without that prompt additional test was under 4 minute average. This was the fastest and the price was 17 which was also the lowest out of five attempts. So when enforcing the tests naturally it adds additional work for LM to process those tests and make them work. And this is where the fun stuff actually happens.
So let me show you other attempts. This was another attempt in 6 minutes and 46 seconds and 18 cents per prompt. Then there was another one 35 cents and almost 16 minutes. And then another one was 32 cents and 12 1/2 minutes. And then one attempt out of five was really special. This is the last screenshot from that. So 77 cents on one prompt versus the average of 10 cents without test requirement and then open code didn't even show the actual time but I measured it. It was 34 minutes on one prompt. And if you want to understand why because the tests failed the first attempt and then the LLM was running in circles and this specific task was specifically testing that. So it was for unknown package and the LM should have read the readme analyze test and then apply the right syntax to avoid M plus1 query. So when the test started failing for LMS I will show you a few screenshots. I started seeing these thought for 25 seconds thought for 12 or 14 seconds. Basically, DeepS seek didn't understand why the tests were failing and then it was trying to fix this, fix that, retest again, then it failed again and the circle goes on. So, this is the example of failed test that was created to ensure that the task was done. And then it reminds me kind of like a trendy thing called loop. These days, looping or goal. Basically, the LLM is looping until some success criteria is met. And in this case, the success criteria is deterministic. The test should pass for N plus1 query problem to be not present.
So, Deepseek was looping and running in circles until success. And it was even fun to watch. Maybe fun is not the right word because it was pretty expensive.
But how it tried to look at vendor files, try to read the docs, try various attempts, then thinking for 29 seconds with all sorts of errors and issues along the way related to the tests that it created. So it had to fix the issues related to the tests so they would even work in the first place. So yeah, the road to the result was much longer and much more expensive. But did it succeed?
The answer is yes. Five out of five. All tests pass. No failures in my evaluation tests. Was it the same for Deepseek Flash? Yes and no. Let me show you. So these were five attempts for Deepseek Flash and this is the test that it generated with the method that was needed. Then 3 cents was spent and 5 minutes in this case. The next attempt 4 cents and 8 minutes. 2 cents and 3 minutes. That was probably the fastest.
Then 2 cents and I restarted that but it was roughly 4 minutes and then 5 cents and 9 minutes. That was flash model that previously without enforcing the test did the same job in 1 minute and 39 seconds on average for 1 cent. So again if we count percentage- wise the difference in time and price is very significant. But in the end it also delivered five out of five successfully but with different code inside. So also let me show you that the difference between pro and flash code. So this is the example code by deepseek flash. It had to implement the array of rules with validation successfully bypassing n plus1 query problem and then it generated automated test which looks like this. And one of the methods probably the last one was related to n plus1 query. So this one so builds the scenario then calls that form request validation class and then expecting the queries to be one query. And then deepseeek pro made the same thing but differently. So this is the validation class by deepseeek pro in addition to the rules which are okay. It created more methods inside of the same form request class like a few properties here for counting the queries and then prepare for validation and measurement and past validation and another example perhaps even more verbose and telling.
So this is another form request and in addition to rules array we have measurement function create validator and a lot of logic basically was in the form request validator class itself and not in the test method which means that you can build multiple test methods, test assertions, test scenarios using or reusing the same form request class which is a better architectural decision and that happened in all five attempts, which is basically the difference between DeepSeek Pro and Deepseek Flash.
Flash is going for quick results and Deepseek Pro is going for a better deeper architectural decisions most of the times. So this is basically the thing that I wanted to show in this video. You may enforce and require automated test as part of your prompt which makes the quality better because the model tests itself. But then what happens if the tests fail for more complex task? Then the cost and the time become unpredictable a bit especially for cheaper models which appear to be not that cheap anymore. So the average cost for Deepseek 4 skyrocketed from 10 cents per prompt to 35 cents per prompt.
And also time grew similarly like three times or so on average from five attempts. And similar difference is with Deepseek 4 Flash. Basically three times longer and more expensive. Still a few cents but much more expensive. So it's still a good deal financially if you don't care about the depth of the code and if you care about quick but working solution with Deepseek 4 Flash. It's still good. But of course depending on your prompt and project and scenario and risk you want to take. And finally, here I am the next morning in the same hotel and I realized I can repeat the same test on more models and test the assumption even wider. So this is the table where I tested not only pro and flash but also other models that made at least one mistake in the previous attempts. So from testing six more models what I found out first all of them started passing the tests with perfect score except one case on quen 3.7 plus it basically cheated. So here's the screenshot. It took very long. It spent 81 cents on the prompt and it said that it delivered the job but with one caveat. The test encountered issues. So at some point it stopped trying to fix the test and said that the task was delivered. And when I ran evaluation tests only the example test was present and all the other tests were basically not present, gone kind of not counting.
So I called it cheating but maybe it's not necessarily the case. Maybe it's actually the right thing to do. We can discuss in the comments. So if the LLM is running in circles and cannot find the solution, maybe it should tell exactly that. So this happened only once in all attempts. All the other models successfully delivered working tests but with much bigger time and budget. So the same time increase. So this is three times, three times, four times all over the place. It's different from the model. And the price is also like averely three times bigger except for one case. As one kind of counter example, I took the latest Kimik 2.7, which I recently reviewed on my channel, which made no mistakes on this task, and I thought, would it take longer and cost more to deliver the same thing? And the answer is no. So, you can see it's roughly the same time and cost with or without enforcing the tests, which means that it had created and executed those tests anyway without me prompting. And this is the main difference. This is probably the main takeaway and this is most likely the reason why OPUS and GPT score perfectly on my benchmark leaderboard because automatically they evaluate themselves. They run automated tests. They run other checks without me enforcing that artificially. Other models in some cases or in all cases did not verify their work. So for some models as I mentioned this line in the prompt changed everything. the time, the cost and the result. So, this was a fun and interesting experiment. What other takeaways we can get from that? We can discuss, of course, in the comments below. And I will keep shooting videos on testing, benchmarking, and experimenting with LLM. So, subscribe to the channel to not miss any of those videos. That's it for this time and see you guys in other
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











