A necessary reality check on the million-token marketing hype that exposes the architectural fragility of self-attention. It effectively demystifies why a massive context window is often just a high-capacity buffer for forgetting.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Context Window Explained: Why 1M Tokens Still ForgetsAdded:
When Google shipped Gemini 3.0 Pro, paying subscribers ran their own tests.
[music] The model card said 1 million tokens. The actual usable window measured in the web app was about 32,000, 30 times smaller than the headline.
>> [music] >> This is the open secret of long context.
Every frontier model has the same gap.
Anthropic admits it in their system cards. [music] Google admits it in footnotes. Open AI admits it with the price surcharge that doubles the input rate above 272,000 tokens. The marketing number is the size of the room. The number you actually get [music] is the patch of floor where the model can still walk in a straight line.
This video is about that gap. Why your million token model still forgets, how the failure modes pile on top of each other, and what working developers do about it. A context window is not memory. It's a buffer. Every token your prompt contains has to fit inside a fixed-size [music] box the model holds in GPU memory while it computes one answer. [music] Your system message, your chat history, the file you pasted, the tool [music] output your agent just generated, all of it competes for the same space.
>> [music] >> When the box is full, the model cannot see anything else. Tokens outside the window do not exist as far as the model is concerned. The [music] right mental model is RAM. Not a notebook, not a brain, RAM. Volatile, fixed, physical.
And the size was decided by the hardware before your program ever ran. Two things people get wrong. The advertised number is usually [music] input plus output combined. GPT 5.5's 1 million includes everything the model thinks and emits.
Write a long reasoning trace and you eat your own context. Second, [music] the chat interface is faking continuity.
There is no state between requests. The platform [music] pastes the whole conversation back in every single turn.
That feeling that the model remembers you, it's a transcript on rewind.
Self-attention works like this. Every token in the window asks every other token, "How relevant are you to me?"
>> [music] >> Those scores get normalized into weights that sum to one.
The model then mixes information from each token in proportion to its weight.
That last detail is where it breaks.
>> [music] >> The more tokens you put in, the smaller the slice each one gets. At a thousand tokens, a key sentence might pull 30% of the attention [music] mass. At a million, that same sentence is fighting 999,000 neighbors for the same finite pie. The model is doing what you do at a party with a million people, nodding politely at everyone, listening to nobody.
Position is another problem.
Transformers are order [music] blind by construction. Modern models patch that with rotary position embeddings, which rotate each token's vector by an angle proportional to where it sits [music] in the sequence. Works great inside the training length. Works worse and worse outside it. Most million token models were trained on much shorter sequences and stretched up to the headline number with tricks that buy capacity, not fidelity.
Five years ago, GPT-3 had a 2,048 token window, >> [music] >> about the length of this section read aloud. May 2023, Claude jumped to 100,000. February 2024, Gemini hit 1 million. April 2025, Llama 4 Scout claimed 10 million. Today, every frontier model ships with at least a million. How did this happen? Not one trick. ROPE rescaling stretch position encodings after training. Flash attention cut the memory bill. Sparse [music] attention skipped most pairwise comparisons. Hybrid architectures mixed transformer layers with state space layers that scale linearly instead of quadratically.
>> [music] >> Stack the wins, the box gets bigger. The product pressure was simpler. Customers wanted to paste in their code [music] base, their contracts, their entire knowledge base, and stop building retrieval pipelines. [music] Vendors heard that and started racing on the headline number.
The failure is not one thing. Long context fails in five different ways at the same time. Each one alone is a few percentage [music] points. Stacked together, they're devastating. The first is called lost in the middle.
Information at the start of the window and the end of the window gets attended to more than information in the middle.
The accuracy curve looks like a U.
Studies showed GPT-3.5 [music] performed worse with the right answer buried in the middle of the prompt than with no documents at all. It's the model's version of skipping straight [music] to the syllabus and the conclusion and pretending it read the book. And it's not a training bug. MIT proved in 2025 [music] that it is structural to how causal masking moves information through a deep stack. [music] You cannot retrain it away. The second is context rot. Chroma Research tested 18 Frontier models in 2025. [music] Claude, GPT-4.1, Gemini 2.5, Gwen, every single one degraded as input length grew. Even on tasks they could solve trivially at short length.
>> [music] >> The weirdest finding, shuffling the context randomly made performance better. A coherent logical document gives the model more plausible looking distractors to confuse itself with than a pile of unrelated paragraphs does.
[music] The third is the literal match shortcut.
When your question and the answer share keywords, attention has an easy hook to grab onto. When they don't, the model has to do real semantic search across the whole window and it falls apart. The No Lima bench spark got GPT-4 O dropping from 99% accuracy >> [music] >> at 1,000 tokens to 70 at 32,000 on the same task [music] just by removing the keyword shortcut. The fourth is multi-needle reasoning. Single needle needle in a haystack test are saturated.
[music] Frontier models hit 99% at 1 million tokens when there is one fact to find.
>> [music] >> The moment you need two facts and a relationship between them, scores collapse. The most famous embarrassment is Llama for Scout. Meta claimed perfect retrieval at 10 million tokens. The community ran a narrative comprehension test at 128,000.
Scout scored 15%. [music] Gemini 2.5 Pro on the exact same test scored 90. Same length, same task, different planet. The fifth is instruction drift. Long agent loops forget the system [music] prompt. The instructions are still in the window, but their weight against the volume of tool calls and file reads piling up around them keeps dropping. [music] Deep into a session, the original prompt is whispering and the recent turns are shouting. The cleanest analogy I've seen for all of this, lifted from a Reddit comment, is being awake for 48 hours straight. [music] You can still see the room, you cannot focus. You confuse what was said an hour ago with what was said yesterday. You hallucinate connections that are not there. That is your million token model at hour three of an agent session. The bills tell the same story. Open AI doubles the input rate above 272,000 tokens for the entire session. Google charges more above 200,000.
Anthropic flat rates up to a million, which makes them the exception, not the rule. One developer fed Gemini AI Studio a series of short prompts on top of a 700,000 token history. 10 turns later, the bill was 121 pounds. Why? [music] Because AI Studio resubmits the entire conversation every single turn with no caching by default. 10 turns [music] to 120 quid.
Latency is just as ugly. Prefill on a million tokens takes many seconds before you see a single output token. Prompt caching helps if the prefix is byte identical between requests. Reorder a tools [music] array, shuffle your retrieve documents, change one character in the system prompt, and you cache miss. Pay the right premium and start over. Stable prefixes save you money.
Anything else does not. The first rule, repeated on every coding agent subreddit, is don't go past 30% of the window. Quality drops visibly past that mark on agent tasks. Filter first, load second. [music] A small rack prompt usually beats stuffing everything in.
Irrelevant tokens don't just add cost.
They make the model dumber.
>> [music] >> Put decisions in files, not chat history. cloud.md, a decisions log, anything the agent can reread on demand. Use sub agents.
Instead of one giant context with everything in it, spawn a sub agent with its [music] own fresh window. Give it a narrow task. Have it report back a one paragraph summary. The orchestrator [music] stays clean and restart don't recover.
The community calls it the three [music] strike rule. If the agent cannot debug something in three tries, the context is already polluted with the model's own failed attempts.
>> [music] >> Kill it, start fresh.
Three things to stop believing. A bigger context window does not mean a better answer.
>> [music] >> Every frontier model degrades with length. The data is overwhelming. Needle in a haystack does not measure long context. [music] It measures lexical retrieval at one position. Real tasks need multi-hop reasoning and that breaks much, much earlier. [music] A million tokens does not replace retrieval. For most production work, a focused 30,000 token rack prompt outperforms a million token dump on quality and on cost. Long context wins specific cases. [music] Legal cross-referencing, full code base reasoning, anywhere the document genuinely is the corpus. It does not win most cases.
>> [music] >> The advertised number is the size of the box. The effective number is the depth at which the model still attends, retrieves, and reasons cleanly. In 2026, those are not the same number and the gap is wider than the marketing implies.
Next [music] time you see a vendor lead with 1 million tokens, treat the number the way you treat a laptop spec with 32 GB of RAM. Yes, it's in the machine. No, you're not going to fill it with garbage and expect it to [music] run faster.
That gap between capacity and competence is the real story of long context.
[music] Until vendors lead with the second number, you're going to have to measure it yourself.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











