This video demonstrates that LLM code generation stability varies significantly across models, with Gemma 4 achieving 90% success rate in generating working Python scripts compared to 0% for Devstral Small 2. The presenter tests five models (Gemma 4, Kimi K2.6, Qwen3.6, GLM-4.7, and Devstral Small 2) by asking each to create a Python throughput benchmark tool 10 times. Key findings include: (1) Model performance varies dramatically for the same task, (2) Prompt quality significantly impacts results, with improved prompts increasing Devstral Small 2's success rate from 0% to 100%, (3) Larger models can generate better prompts than humans, and (4) Model selection should consider specific task requirements rather than general reputation. The presenter recommends using Gemma 4 for small Python tools and suggests asking larger LLMs to write prompts for smaller models to achieve better results.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
LLMs vs Python. I asked each model 10 times to create the same small python script. Gemma 4 wins.追加:
I have an idea for one Python command line tool and I decided to test how successfully this Python script can be implemented with different LLM models and I asked 10 times each model to create the same Python script and here on the screen are results. So in my test best results I got from Gemma 4 and completely zero successful Python scripts was created by devstral small but I left this in the list because every test it created valid Python syntaxes. It just have not implemented all uh option what I asked it to do. So now now I will quickly go to explain what was my test. I decided to write a Python uh common line tool that it will measure LLM throughput benchmark. They already exist. For example, VLM has built-in benchmark. SGLAN has built-in benchmark. There are other LLM benchmark. I only wanted to create small Python tools that it will measure uh throughput tokens per second. There is correct way how to test it. It's you go to the hugging phase. You download from the model tokenizer file. For example, for ZF4, this is tokenizer.json file 32 megabyte. So you can use this file to identify tokens in how many tokens was produced. We never know how many tokens text has uh because we don't know how model was trained. But for example, I just Googled one example uh unbelievable. It can be three tokens.
For example, on believe able. If we will use tokenizer, we can find out how many tokens was generated in the text output and divide it by time and we will find tokens per second. This is correct way to count tokens. I decided to create Python script that does not uh need to use this tokenizer.json file. So there are alternative methods.
Method number one is to use streaming.
We stream uh output from a text and often different frameworks for example via LLM outputs one token per line. So we can just count how many lines was in output divide by time. Now we have how many tokens per second from this model.
Second method according some study I found it's roughly for English text equal four characters is one token.
Divide by four and divide by time we get tokens per second. A third alternative method how to count tokens. Some uh frameworks uh they can output uh how many tokens was produced. I'm using the same prompt 2 plus two but without streaming and in this output I'm getting this part completion tokens 177 divide by time how long it took to produce this output and I will get tokens per second. So I'm asking LM model to create Python script to implement these three methods without tokenizer.json.
So this is a prompt what I created. I wrote it manually. What I think I can explain to another person. So create a Python script uh measure LM throughput use OpenAI compatible API use only Python 3 standard library. This is options what I was expecting and I asked it to implement three different method method A uh method B and method C. First I repeated it 10 times with each model and then I repeated it here is for loop.
I repeated it 100 times. So you can see on this side you can see all produced Python scripts and final line in this script is to finally run this benchmark against some model and I'm asking what is LLM write at least 50 words. So uh this is example I tested with VLM built-in benchmark to see what numbers I can expect. So I use this command uh to run benchmark. I'm using quen 3.6 model where which I'm benchmarking and I'm getting 152 tokens per second. So gemma 4 and nine out of 10 times I got great script which measures what I asked Kim K2 was not so great it's much larger model 7 out of 10 was success rate quen 3.6 six. This was surprising four out of 10. GLM 4.7 and Devstral Small 2 failed on all tests. So I decided what I can do about this and I thought maybe my prompt is not great enough. Let me ask a larger model like Kim K 2.6 and improve my prompt. And I iterated several times. I asked Kimi to create prompt. I'm tested it with devstral failure. Success failure success. And I found prompt that will create successful script with devstral. And this is improved prompt. It's much larger, more detailed from Kimmy. So this is this is my div output after I used improved prompt.
Success. Success. Success. Syntax is error. Success. Success. S. Yeah, this is success. It just very highly accurate result. Uh, success. Success. Success.
Nine out of 10 times. Devstral can create this script. I run it one more time.
All 10 are successful. And this is Devstral small two which was failing with my manually written prompt. Now it's performing better than other models. My prompt was not great. Prompt improved by LLM model be made much better result that I can do.
Maybe LLMs can talk better to other LLMs than humans. What conclusion I can do?
Don't write your prompts. Ask Larger LLM model to write your prompt and then use smaller model. Smaller models can do better if you can write good prompt. And surprising for me was quen 3.6 uh it's very praised on the internet compared to Gemma. But in this small test for some reason Gemma 4 writes better small Python tools with standard library. Quen 3.6 six wasn't so good for this results and it was mostly creating in successful 40% of the time and I can create better prompt and quen 3.6 six will perform much better. But for me, it was interesting to get results one shot. So this is my takeaway. If you want to create small Python tool, try GMA4. It's surprisingly better than other models.
Bye. old joke about how people build uh ships inside the bottles and joke was you putting all ship components in the bottle adding glue shaking and sometimes you get ships. So I think this my experiment is the same. I'm just shaking and I'm just checking how successful this bottle
関連おすすめ
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











