DeepSeek proves that architectural elegance beats brute-force scaling by turning spatial coordinates into a native language for reasoning. This shift commoditizes high-precision visual grounding while exposing the massive computational waste of current industry giants.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
DeepSeek Just Killed Visual Reasoning (And It's 10× Cheaper)Added:
One of the biggest problem with deepseek v4 was that it's texton model but seems like some users are getting access to the vision version and now they also released a paper titled thinking with visual primitives and then it kind of disappeared. The headline numbers are genuinely wild.
So for an 80x 80 resolution image, this new model uses about 90 entries in its KV cache. In comparison, Sonnet 4.6 uses around 870 whereas Gemini 3 flash uses,00 that's almost 10 times less. So, it seems like they have figured out how to create and serve extremely efficient vision models.
So, in this video, I want to do three things. First, I'll walk you through where this fits in deepseeks vision lineage because this is not really a one-off paper. They have been telling the same story for 2 years now. Second, I'll explain the actual idea which they call thinking with visual primitives because it's a really really clean concept once you see it. And third, we will look at the architecture, the benchmark numbers and the limitations that they are not hiding.
Okay, so let's start with the lineage.
Deepseek has shipped roughly seven vision related models in the last 24 months. Back in March of 2024, they put out DeepSeek VL. It was a small modest 1.3 and 7 billion parameter models. Uh they were using hybrid SIG lip and SAM encoder. uh which was nothing flashy at the time but it set the foundation. Then in October 2024 they released Janice.
This one mattered architecturally. Uh they decoupled the visual encoder for understanding versus generation. Most unified multimodal model at the time had this single encoder bottleneck where the model was forced to compromise. Janus basically said no and the idea was to run two encoders but share the transformer.
In December 2024 they released a second version of the vision language model.
This is where the efficiency story really starts. As with the other deepseek models they ported mixture of expert and multi head latent attention from V2 and V3 into vision. a tiny version had only 1 billion u activated parameters but it was scoring 809 on OCR bench and 88.9 on doc visual QA these were small activations with really big numbers and that's kind of the pattern of deepseek released models then in January 2025 uh they released Genus Pro 7B now this one went viral because it landed during the R1 moment and it had some really awesome Evolve numbers on Gen Evolve. It was 80% beating Delhi uh three at and the main thing was that you could run this model on a single consumer GPU but uh the real breakthrough was October 2025 when they released DeepSeek OCR.
This is the first version and the framing was strange because they called it OCR but it really wasn't. The actual idea was that take thousand text tokens, render them as an image, encode the image and you get back 100 vision tokens that encode to the original text at 97% accuracy which is 10 times compression on long context and Karpathi reacted to it like this quote the tokenizer must go pixel may be better inputs to language models than text. End quote. That's a moment a lot of people started talking about DeepS vision Tim. Seriously. So if you ask uh what the through line is across all these models, it's basically one question. What's the cheapest representation that still works? So V2 uh said fewer activated parameters in Genus. They said decouple the encoders.
In the OCR paper, uh they said compress text into pixels. And this paper says make special coordinates first class tokens in the chain of thought. Okay, so let's get into what the paper is actually arguing. There are two gaps in the current multimodal models. The first one is the perception gap. The model cannot see finer details. Most of the most of the work in 2024 and 2026 was on this high resolution cropping, dynamic patching, thinking with images.
That's all about seeing better. And we have seen some example from Frontier Labs like with the nano banana models with the open AI image generation models. But the Deep Seek team is arguing that there is a second more fundamental gap they call the reference gap. Even with the model sees perfectly, language is too imprecise to point.
Think about it like this. If I ask you which one is the third bear from the left on the rocky ledge, you can describe it in words. But you will lose track of which entity you are actually talking about as your reasoning gets longers. Now humans solves this with uh finger gestures. Models until now didn't have that equivalent. So here is what they actually do. The user asks, "Count the number of men in this image." It's a team photo. Uh, it's very tense. So, instead of the model just trying to count in language and confusing itself, the model writes out mid thoughts uh, the bounding boxes of every person it identifies. Every single person has a coordinate and the format is just inline tokens.
The model literally outputs a reference tag with a label followed by a box tag with two corner coordinates. These are special tokens in the model's vocabulary. It's not function calling.
It's not a separate tool. It's part of the chain of thought. You can actually see why this is really really powerful idea.
Counting in dense scenes, multihop special reasoning, and even the famous Chihuahua versus muffin meme. It all becomes more reliable when the model can literally point to things.
All right. Uh so let's look at how the model is built because the efficiency angle is genuinely impressive as with other DeepSeek models. The base architecture is pretty standard. Image goes into a vision transformer. Text goes into a tokenizer. Both feed into large language model and there is a D tokenizer on the output side. Uh the interesting part is the choices they made. The language backbone is Deepseek V4 Flash which is a smaller of the two V4 variants that were released and we're talking about uh 284 billion parameter mixture of expert with 13 billion active parameters. So you're getting frontier grade reasoning but only paying for 13 billion parameters at inference. And this vision or visual encoder is where the efficiency story really shows up.
They built their own vision transformer from scratch called a deepsequation transformer that supports arbitrary resolution. It uses 14x4 uh patches. So, uh, 756x 756 image, uh, you're going to get about 571,000 pixels. First becomes 2,916 patch tokens. Then they apply a 3x3 special compression along the channel dimension uh, taking nine adjacent patches into one. the that brings it down to about 324 tokens and then their compressed sparse attention mechanism which is from the V4 paper compresses the KV cache by another factor of four end results only 81 entries in the KV cache for the image pixel to a KV cache entry uh that's about a 7,000 times compression ratio now I think this is the real story in this whole model uh release. They're not just shipping a vision model. They are shipping a vision model that costs roughly a tenth of what every other Frontier model cost to run on the same image. And that's kind of crazy. Now, the training pipeline is also worth a quick look because it's one of the more elegant designs I probably have seen recently. There are five different stages. pre-training on trillions of multimodel uh tokens. Then specialized supervised fine-tuning where they train two separate models. One for thinking with grounding using boxes, one for thinking with pointing using points.
then specialized RL with GRPO uh on each with three reward heads uh and these are format quality and accuracy then a unified uh RFD that merges them and finally on policy distillation into a single student model so basically they are training two specialists and then consolidating them okay so let's talk about the benchmarks because u this This is uh where I want to be careful. The most impressive numbers are on topological reasoning on the maze navigation benchmark. They constructed this model scores about um 67% against 49% for Gemini 3 flash, 50% for GPD 5.4, 80 or 49% for Sonnet 4.6.
And if you look at it, that's almost 17 points gap over GPT. Now for patch uh tracing, it's a very similar story. Um you're looking at almost double the score on on the lowest competitor on counting and special raising. Uh they win some, they tie some. Gemini Flash 3 is still ahead on raw count QA but the topological tasks um maze path tracing those are exactly the task where you would expect pointing to help because the language is uniquely bad at trajectory descriptions.
Now I want to flag something. The paper is honest about it but um a lot of coverage will skip it. There is a footnote that says the reported scores cover only a subset of evaluation dimensions that were directly relevant to the research focus of this paper and are therefore not indicative of the uh model's overall capabilities. So they are not claiming uh this beats GPT54 across the board. They are claiming it beats GPD54 on visually grounded reasoning tasks. Uh that's a more honest claim. Uh I really appreciate they put it in the paper and to be honest u overall I think DeepS does a much better job at honesty than compared to some of the other model providers because they always point out their limitations. There are also three limitations they admit. Uh first the model is resolution bound so fine grain scenes can still trip it up. Second, the visual primitives uh mode currently has to be triggered explicitly. The model doesn't autodeide when to use it yet.
And third, uh the point-based topological reasoning. Um the M stuff doesn't generate well across uh doesn't generalize well across scenarios. So, it's not magic. Uh but the direction is really interesting. And here's the um timing piece. April 29, uh, DeepSeek started rolling out vision mode in the app and on the web, uh, at least a limited test along their fast and expert modes. So, while the paper itself is hard to find right now, the model behind it appears to be in graying rollout. Uh, I don't have access to it yet, but hopefully they are going to release it more widely pretty soon. Now the bigger story here honestly is uh what's underneath this model and that is Deepseek V4 flash the paper sites uh V4 as reference number three highly efficient million token context intelligence.
Okay, so that's where we're going to wrap it up today. I hope you found this useful. Do let me know in the comments if you would use a model like this when it becomes available. Anyways, thanks for watching and as always, see you in the next one.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











