AI model benchmarks like LLM Arena measure human preference through head-to-head matchups rather than raw capability, meaning a model can rank highly by being faster and more agreeable without necessarily being smarter; this explains why Gemini 3.5 Flash's strong Arena performance may not fully reflect its actual reasoning capabilities, and why real-world evaluations are essential for assessing model quality.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Gemini 3.5 Flash Beats Opus 4.7...What Is Going On?Added:
Two big Google leaks dropped today and IO is only four days out. There's a new Gemini 3.5 flash checkpoint showing up in arena with outputs that don't really make sense for a flash model and there's a new agent feature called Gemini Spark leaking across X. So, let's go through both. There's a new Gemini Checkpoint that just showed up on Eller Marina and the outputs is producing are kind of hard to believe for a Flash model. is listed actually under the name Gemini 3 flash which is a confusing label but based on what people are testing this is almost certainly the 3.5 flash checkpoint in stealth. The interesting part is the outputs it's producing is reportedly outperforming Gemini 3.1 Pro and is also coming out ahead of Cloud 4.7 Opus. So a flash tier model on paper is going head-to-head with two of the stronger reasoning models out right now.
But that's not how the tiering is supposed to work. I covered the last round of Gemini Checkpoint leaks a few days ago and at that time the leaks were pointing at Gemini 3.2 Flash and Gemini 3.2 Pro and the outputs from those were pretty underwhelming. A lot of people were wondering what was Google actually doing because those models did not seem competitive. But this checkpoint feels different. The output people are pulling out of it look noticeably more capable and they are cleaner generations, better oneshot web applications and is fast in the way flash models are supposed to be which makes me think of a couple of things. One, Google might be feeling really competitive pressure right now.
OpenAI has been shipping constantly.
Enthropic just dropped Open 4.7 and Gemini has been quiet for a long time.
So, it's possible the team stopped holding back and they're trying to release something very competitive. But two, and this is the part I keep coming back to, arena is a vibes benchmark. It measures human preference in head-to-head matchups, not raw capability. A model can win on arena by being faster and more agreeable without actually being smarter. So before we read too much into the ranking, we need to see this on real reasoning evaluations. The other thing to keep in mind, Google IO is on May 19, which is 4 days from now. So, whatever Google officially announces is probably going to be tied to this checkpoint or some version of it. And honestly, in my opinion, for Gemini to stay competitive in this cycle, what they ship needs to actually land, not just be a bit better than 3.1, it needs to be good enough that builders consider switching over.
So, short version, there's a new Gemini Flash checkpoint floating around. The updated outputs look way better than the leaks from a couple of days ago, and the arena ranking is interesting if it holds up, but this is still pre-release with no official confirmation from Google DeepMind yet. So, we'll know a lot more on Monday. And speaking of what people are generating with this thing, someone built a working Minecraft clone with it, and this is probably the best one I've seen so far. So, let me show you guys right now. Before we continue, we just launched the Universe of AI newsletter.
If you want to stay on top of AI news without having to hunt for it, link is in the description. Don't miss out. All right, I'm going to show you guys the game that somebody built with the new Checkpoint, which is crazy to me because they added a multiplayer. So, we'll test out if the servers actually work, but if you look at the credits, it kind of tells you that the game was created using React 19, Vite 3.js, and Tailwind CSS. This is just a demo, and this is obviously trying to be the Minecraft clone. So, we'll go back to the main menu. We'll go to single player and then I can set my name, world speed seed, I guess, everything like that. Difficulty, I'll put it on easy for now because I'm not that good and I tried playing it before. I was kind of getting decked by the zombies. And we'll do survival mode and create world and start. So, you can see we have zombie being actually generated. And I can look around everything. Can I hit this guy?
So, I can I got to run away before it blows up. But they're all chasing me.
You can see that. So, this is the first time I've seen number one a zombie mode being added to the Minecraft clones. And we have animals and everything. This actually looks pretty good. Like, this is probably the best. Oh, I don't know what happened over there. I think I got launched into something. What exactly happened? Pretty glitchy. Oh. Oh. Oh, I I died. But anyways, that was the thing.
Let's just go back to creative mode so we can see. Or actually, let's try out the multiplayer mode. I don't know if this is going to work. I'll be honest.
We'll do this one and then just connect.
Says connecting to the lobby. Yeah, I don't I have a feeling that this is not going to work. Like I don't see anyone else.
But yeah, I'm not that good. I'm just going to try just the um what do you call it? Single player mode on creative because I keep on getting decked. Okay, maybe the game is a little bit glitched because I put creative and I'm getting zombies in it. I'm just gonna fly to the top. But you guys can see like this is the best generation I've seen so far.
The world is definitely not being generated fully. Like this is all I have. It's a simulation, so this is all I'm going to see. But it's not bad for what we're able to see. And we can actually place stuff as well. But for some reason, it glitches out and sends me back to the base that I'm respawning in. So that's not the fun part, but other than that, I can move around, look at the world. So this is pretty good.
But not everything is really great about this new checkpoint. The first thing that people are noticing that the Gemini UI, the cool vibe, a lot of people really liked about the Gemini UI was completely removed and is more GPT5 now.
So I guess they're kind of copying GPT 5.5 based on UI design. And also, it's really good at being fast smart for a flash model, but it's not really doing a good job when it comes to prompt adherence. For example, this person told it multiple times not to use web, and it still did. So, the model's clearly not listening, and it keeps filling websites with panels even when the prompt asked for restraint. So, basically, it's not really good at listening to the prompt that users are giving. Maybe this is just a glitch for the checkpoint, but this is something that Google definitely needs to fix before they ship. Because if the model can't understand the prom properly and stick with it, no matter how good or fast or capable it is, it's not producing the results that people want. So, obviously, it's not going to be as useful compared to a model that actually listens to people and actually sticks to the prompt. But, as I said, this is a 3.5 flash checkpoint.
Hopefully, we do get the new flash model and it comes with better prompt endurance. that would make things much better. There are also screenshots of a new Google feature called Gemini Spark.
And from the looks of it, this is going to be Google's actual agent experience, most likely shipping inside the existing Gemini app rather than a standalone product. But the landing screen says, "What can I get done for you today?"
with a few starter prompts like declutter your inbox, get meeting briefs, get custom news digest. So, this is being positioned as a task completion surface, which is not another chat interface. The part that really is interesting is the skills creation flow.
There's a create skill screen with a title field, a description field, and an instructions field, which if you've been paying attention to is what Enthropic ship with Claude a few months back, which is basically the same pattern.
You're writing little procedural recipes that the agent can pull in when it needs to do a specific kind of task. One small thing worth pointing out based on the leaks, there's still no option to import a skill.md file directly. you still have to copy and paste into the instruction box, which is a weird limitation if they're trying to compete with the open skills format people are already building with, but hopefully that changes before the launch. The other detail from the leaks, Spark will apparently have an agent mode and a chat mode with what they're calling advanced tool use, but there's no sign of browser use or computer use in any of those screenshots. So whatever tools it's going to reach for is probably still going to stay inside the Google ecosystem for now or it'll be limited to plugins that they add. So think of Gmail, calendar, docs, that kind of thing, which actually makes sense as a strategy. Google's strongest moat isn't the model. It's the fact that most people already have their email, their docs, and their calendar inside Google.
So if Spark can reliably do real work across those surfaces, that's a more defensible product than another chatbot that people have to do. But what's funny is that Enthropic as well OpenAI have plugins that connect with Google's ecosystem. So it's uh you know not really an amazing strategy for them. The timing on this is also the part that ties it all together. Google IO once again is on Monday, May 19th. So we got a new Gemini Flash checkpoint scoring weirdly high in arena. We got Spark leaking with agent mode and skills probably landing inside the Gemini app.
So Monday is shaping up to be a real big moment for Google. If they're able to pull it off is a different question, but the pieces they're putting on the table are starting to look competitive and on Monday and throughout the IO conference.
Any new announcements or updates I'll be covering on this channel. So if you want that breakdown the moment it drops, make sure you guys are subscribed. But that's it for today's video. Make sure you guys are subscribed to the channel. Follow our new newsletter as well at universeai.behive.com beehive.com as well as subscribe to the main channel World of AI and support us on X by following the universe of AIZ as well.
Until then, I'll see you guys in the next
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











