Install our extension to search inside any video instantly.

Take LLMs out of the Cloud and run on Device
Added: 2026-05-12

2,462 views3095:38kiraa_aiOriginal Release: 2026-05-10

Dr. Brandt delivers a masterclass in translating complex model compression into a clear vision for the future of private, on-device AI. It is a pragmatic guide to liberating intelligence from the cloud without sacrificing the core logic of larger systems.

[00:00:00]In a previous video, I explained one of the ways that AI models are being made small enough to run on device is through quantization. It works by trading some of the models accuracy for size to allow them to be run on smaller devices.

[00:00:14]Last weekend, I even got a 4-bit quantized version of a model working on an iPad with surprisingly good response time and reasonable accuracy. So, if quantization works by rounding the numbers to make a model smaller, what other techniques can we use?

[00:00:30]Quantization is one, but there's another one that's probably even more interesting, which is called distillation. And like quantization, it's something that humans have been doing for centuries in other fields, but we just never called it that. So, in this video, I want to explain in plain English what is distillation.

[00:00:47]And to do that, let's roll the clock back 100 years and think about how an apprentice blacksmith might learn from a master. The master might have spent 40 years hammering steel, making mistakes, learning which metals work for which job, knowing the exact color of the steel, when to strike and when to wait.

[00:01:06]But the apprentice doesn't need to repeat all 40 years. They can watch the master and they can copy the technique.

[00:01:12]They can absorb all that experience in a fraction of the time. Now it will never be as good as what the master knows but for some jobs they can produce work that's almost as good and they can do it faster, cheaper and in more workshops and that is distillation.

[00:01:28]So in AI terms you can take a big model maybe with hundreds of billions of parameters and then you throw a few thousand questions or few million questions at it. The big model will answer those questions and then you train a smaller model to mimic those answers. The small model is not learning from the raw data in the same way that the big one did. It's learning from the output of the big model. So effectively, it's copying the master. And quite often, the smaller model punches well above its weight because it's not learning from scratch. It's learning from something that the model has already figured out and leaving a lot of the unnecessary information out. So let's distinguish this from quantization.

[00:02:10]Quantization keeps the same model but stores the numbers with less precision.

[00:02:15]It's the same brain but it's just less detail in those connections. But distillation is different. Distillation usually produces a new smaller model that's trained on the outputs of a bigger one. It might use a different architecture or it might be a smaller model from the same family. But the most important part is that it's not the original model with fewer bits. It's a student model to train to copy the behavior of the teacher model. So if we use the photo example from last time, quantization is like taking a highresolution photo and saving it as a smaller JPEG. But distillation is more like hiring an artist to paint a copy of the photo on a smaller canvas. So it's not the same image. It's not even a compressed version of the original image. But if the artist is good, they can capture most of the detail of what mattered in the original. And that's why distillation is really good for ondevice AI. You can take a huge model that runs in a data center and use it to teach a smaller model that can run on a laptop, a workstation, or perhaps even a phone.

[00:03:21]Now the small model will not be as capable across every single task and it's not going to know everything and it won't work as well in every single situation. But for many use cases it could be good enough. And if it's good enough the economics change completely because now the model can be run in your office on your hardware under your control.

[00:03:46]Distillation became a much bigger story when Deepseek released its R1 family of models in 2025.

[00:03:52]The headline that everyone grabbed was that Deepseek had trained a highly capable model for only $6 million US.

[00:03:59]And that number shook the entire industry. So much so that Nvidia lost a couple of hundred billion in market cap in a single day. But to be precise, the $6 million figure was only the GPU cost of the final training run. didn't include all the research, the failed experiments, or the infrastructure behind the model. So, Deep Seek didn't build the whole thing for $6 million, but that didn't really matter because they what they did is they proved that they could build a model that was dramatically more efficient. And distillation was a key part of that story.

[00:04:35]And that's why when I look at where AI is going, I keep coming back to the same conclusion. The most interesting stuff is not going to be in the data center.

[00:04:42]it's going to be in the Mac Mini or the Mac Studio sitting on someone's desk.

[00:04:48]Distillation along with quantization and other techniques are ways where we we're taking workloads out of the cloud and putting it much closer to where the work is being done. This is the master teaching the apprentice and then the apprentice goes off and does the work and then the work can get done in more and more workshops by more people at lower cost. This is AI finally growing up. I'm Dr. Dr. Errol Brandt, the founder of Kira. If you enjoyed this video, please consider liking and subscribing. And I'd love to hear your thoughts in the comments below, even if you disagree with me. Thanks again for watching, and I'll see you in the next one.

#business analytics #kiraa #innovation

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

Paris is in SHAMBLES right now 😭

H1T1

4053K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30