Installieren Sie unsere Erweiterung an, um sofort in jedem Video zu suchen

Hidden inside Gemma 4 — the inference trick from 2022 #AI #GoogleAI
Hinzugefügt: 2026-05-17

11,120 Aufrufe4642:52DIYSmartCodeOriginalveröffentlichung: 2026-05-11

A sharp reminder that modern AI breakthroughs are often just clever engineering tricks recycled from years ago. It effectively demystifies how Google trades predictability for speed to keep their massive models practical.

[00:00:00]Two language models running side-by-side.

[00:00:02]Faster than one running alone. That is not a trick.

[00:00:06]That is how the fastest LLM ship today.

[00:00:09]Here is what is actually slow about a normal language model.

[00:00:14]You feed it a prompt. It runs one full forward pass.

[00:00:17]It produces one token. You append that token.

[00:00:21]You run the whole model again. For the next token.

[00:00:24]And the next. A 70 billion parameter model.

[00:00:28]Spinning up for one word.

[00:00:31]But here is the thing.

[00:00:33]Most tokens do not need a giant model to predict.

[00:00:36]The next word after why did the chicken is going to be cross.

[00:00:40]A tiny model gets that right.

[00:00:42]The next word after the quadratic formula is every model gets that right.

[00:00:47]So why are we paying for the giant model on the easy parts?

[00:00:51]This is speculative decoding. Published by Google in November 2022.

[00:00:56]Two models. A small fast draft model. A big slow target model.

[00:01:00]The draft model speculates four tokens ahead.

[00:01:03]Cheap and fast. The target model verifies all four.

[00:01:07]In a single parallel pass. Match the draft.

[00:01:10]Accept the tokens. Disagree somewhere.

[00:01:12]Reject from that point.

[00:01:14]The target corrects it. By the way, multiple tokens.

[00:01:18]For the price of one forward pass. Best case.

[00:01:21]K plus one tokens per round. Worst case.

[00:01:24]Still one token. Same as without it.

[00:01:26]Average.

[00:01:28]Two to three times faster. Same output.

[00:01:30]Identical distribution.

[00:01:32]No quality loss. The original paper proved it on T5XXL.

[00:01:37]But the speed up is not constant. Math.

[00:01:39]Code.

[00:01:40]Boilerplate. The draft is right almost every time.

[00:01:44]Massive speed up. Creative writing. Open prose.

[00:01:48]Every word branches a hundred ways. The draft guesses wrong.

[00:01:51]The speed up collapses. Same model. Same prompt.

[00:01:55]Different speed depending on what you ask.

[00:01:58]And it is not just a trick anymore. It is becoming architecture.

[00:02:02]Google's Gemma 4 ships its own drafter built-in.

[00:02:05]Multi-token prediction head, 76 million parameters.

[00:02:10]Sharing the target model's KFO cache.

[00:02:12]That is speculative decoding evolving.

[00:02:15]From an inference hack into the model itself.

[00:02:18]Want to try it today? LM Studio has it built-in.

[00:02:22]Load your main model, pick a smaller draft with the same vocabulary.

[00:02:26]Start chatting. For production, vLLM takes one flag.

[00:02:30]Pass speculative model and number of speculative tokens.

[00:02:34]Same trick, identical output. So, next time someone tells you that you cannot make an LLM faster without losing quality, they are wrong.

[00:02:44]The math has been here since 2022. If you want to learn more about AI, check out the Dynamis AI community.

#ai #ai coding #artificial intelligence

Ähnliche Videos

Künstliche Intelligenz

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Künstliche Intelligenz

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Künstliche Intelligenz

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Künstliche Intelligenz

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Künstliche Intelligenz

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Künstliche Intelligenz

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Künstliche Intelligenz

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Künstliche Intelligenz

AI Doesn't Create Bias — It Inherits It

UXEvolved

176 views•2026-06-01

Trends

Why Batman Lets The Joker Live 🤨

zackdfilms

9222K views•2026-05-30

They're Complete Trash

penguinz0

558K views•2026-06-04

Paris is in SHAMBLES right now 😭

H1T1

4053K views•2026-05-31

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30