The shift toward agentic benchmarks is a necessary evolution in AI evaluation, yet the inclusion of non-existent models like GPT-5.5 undermines the video's technical credibility. It prioritizes speculative hype over the rigorous, evidence-based analysis expected in professional tech discourse.
Inmersión profunda
Prerrequisito
- No hay datos disponibles.
Próximos pasos
- No hay datos disponibles.
Inmersión profunda
Gemini 3.5 Flash Is Better Than Kimi k2.6 & Antigravity 2.0 New AI Coding Agent like CodexAñadido:
Hey guys, welcome back to another new exciting video. So, yesterday Gemini published their new model which was the Gemini 3.5 Flash and here you see that Cursor have their own benchmark which is the Cursor Bench and this is basically they have made based on the real-world agentic task. And here you see Gemini 3.5 Flash is here and Kimi K 2.6 is here. And also it is beating GPT 5.5 and Opus 4.7 low variant model. Now, the thing is that um why this Gemini 3.5 Flash model is so important for us? So, here you see that this is the artificial analysis page and they basically test all of the models that got published till now based on the different different task. And for that same task, how these models are performing means the speed and also the cost and also the token the answer token and reasoning token reasoning token. How much token they are consuming? And also the intelligence versus output token. And based on that measurement, we get the idea that actually that model is beneficial for us or not. Okay, so in this video we will do the detail analysis. And also you know that yesterday antigravity 2.0 got published. And also here you see, if you download it, then you will get this kind of look of the ID. So, previously the ID was like on the right-hand side there was a model selection. And on the left-hand side there was a kind of VS fork looks. But now they have changed it to the Codex like look or you can say the Plot desktop kind of look. On the left-hand side there is a all of the history chat history and also here you see this is the model selection option.
And similarly, if you go to this Codex, there also you will find this model selection option. The same thing. And I personally a big fan of this kind of interface. On the left-hand side there should be the conversation history and on the uh middle, here But see that there should be the model selection option, and you can select uh this 3.5 flash high and 3.5 flash medium. Now, let's come back to this point that uh why this 3.5 flash is very important.
Now, we have got this kind of chart from the CursorBench. So, CursorBench actually Cursor's internal benchmark system, like there are HumanEval benchmark and SW Bench Pro, SW Bench Verified. So, Cursor is saying that uh this SW Bench Pro and SW Bench Verified all of them have some predefined questions. And many of the benchmark contents are single coding issues or a single GitHub issues that you need to solve.
But uh Cursor benchmark is based on the real-world agentic task. And if you scroll down this benchmark uh this blog post, uh then you will see this portion.
We built CursorBench uh to measure multiple dimensions of agent performance, including solution correctness, code quality, efficiency, and interaction behavior. This blog focuses on solution correctness results, but in practice we evaluate agent across all of these axes, okay. So, this is actually uh required that how the real developer solving the task. Means a model, Gemini 3.5 flash or Opus or Chimera 2.6, they use that model as a real developer who is solving the real-world agentic task. And if you see the benchmark score, uh this is the benchmark score. Uh Gemini 3.5 flash is here, 49.8% and it cost around $1.94.
And the interesting thing is that Cursor's own model, which is the Composer 2 and Composer 2.5, is scoring more than the Gemini 3.5 flash. And also the cost is low uh per task. Now, you may say that uh these Composer models are Cu- Cursor's own model, so anything they can do to show their models are good. It is correct, but uh from all over the social media post uh from Twitter or Reddit, I have found uh that composer 2.5 is really really performing well. I don't have the cursor for sub pro subscription, so that's why I was not been able to test this model, but the feedback that I have got from social media, from top creators also, and from some YouTube videos, I have found that composer 2.5 model is really performing well. And composer 2 was also great, but Gemini 3.5 flash is below the composer 2 and cost is high, and it is beating the open source model Kimiko 2.6. Okay.
Now, if I talk about this Gemini 3.5 flash and Kimiko 2.6, which one you should use? So, for that you need to go to this artificial analysis, this test output tokens. How do you see that the token usage for the Kimiko 2.6 is here.
It is around 170 million, and token usage for the Gemini 3.5 flash is here.
It is around 73 million. Okay. Now, for the same task, the model Kimiko 2.6 and Gemini 3.5 flash, they have given the same kind of performance, same scoring, almost same scoring, but Gemini 3.5 flash used less token. So, this is a token efficient model. It is the information that we have got, and now come to this cost efficiency.
So, cost you see that Gemini 3.5 flash, I already informed you in my last video, this was the video that the output token cost is $9, right? Now, how do you see the cost is around 1,552.
And if you see the previous score previous cost of Gemini 3.1 pro preview was 892.
So, it almost got doubled in case of 3.5 flash. Okay. Now, in case of Kimiko 2.6, how do you see the cost is $942 $48, which is very less than the 3.5 Flash.
So, one thing is clear from this difference that Gemini 2.6 actually great for the high reasoning purpose. It thinks a lot. It is for the hard reasoning and the cost is also less because we know that the pricing for the Gemini 2.6 is very less.
Okay. So, now you see the speed and latency.
For that, Gemini 3.5 Flash is here.
Output speed is a very fast model. This is and low latency model. But, in case of Gemini 2.6, here you see that its speed is 90 and Gemini 3.5 Flash, its speed is 211. Means, almost you can say 2.5x than the Gemini 2.6.
Okay.
So, in case of speed and intelligence and latency, you can obviously go for the Gemini 3.5 Flash.
And also, Car Service is saying the same thing. Means, on the real world agentic use cases, Gemini 3.5 Flash is really beating the Gemini 2.6. So, if you see their overall intelligence score, Gemini 3.5 Flash is here and the Gemini 2.6 model is here. So, 55.3 and 53.9.
So, it is also officially visible that 3.5 Flash is more better very better than the Gemini 2.6. Gemini 3.5 Flash model beating the GPT 5.5 and Opus 4.7 low version. You know that GPT 5.5 and Opus 4.7 low version also have the great capability. So, this model beating the low version of GPT and Claude. So, this top-tier model, this is also very great to see that. They have also announced that they are going to release the Gemini 3.5 Pro model in the coming months, maybe in the middle of June.
So, let's wait for that. I think that that a Pro model will be around here, okay? So, let's wait for that. Then I will make again a separate video. I will do that full analysis then again. And yesterday I did a full detailed testing for this Gemini 3.5 Flash model. The output The output was really really better than the previous Gemini series models. And they have improved a lot a lot. And also it has really a great capability between the Sonnet and Opus.
So, please watch this video. Those who are saying that Gemini 3.5 Flash model is not good, that means you have not done the testing yet properly. So, please do the testing and please consider this model as a really very great model. So, see you guys in the next video. Thanks for watching.
Bye-bye. Take care. I hope that you found this detailed analysis of this 3.5 Flash uh very helpful. If you found helpful, don't forget to subscribe this channel.
Don't forget to like this video also.
See you guys in the next video. Thanks for watching. Bye-bye. Take care.
Videos Relacionados
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











