A masterclass from the actual architects of the digital age on the invisible math that powers our screens. It is a rare, high-signal deep dive into the engineering brilliance that makes the modern internet possible.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
How video compression works - VLC lead developer explains | Lex Fridman PodcastAdded:
So the thing that we're talking about is everything around video codecs, video encoding, video decoding, video streaming, video player client that I'm wearing on my head, the entire ecosystem enabling free media. Uh we'll talk about ffmpeg. We'll talk about video land VLC and all the other incredible video technology uh that is used probably by billions of people. So JB, you're the lead developer behind the legendary VLC player. Kieran, amongst many other things, you're lead developer behind the legendary FFmpeg handle on Twitter. And both of you have spicy opinions, I would say. So today I want to talk about FFmpeg and VC.
uh for context for people who are not aware and I'm sure basically everybody listening to this have used these two technologies probably regularly without knowing it. So FFmpeg underlies basically most video on the internet including YouTube, Netflix, Chrome, Firefox, of course VC and countless other video platforms. It is estimated that over 90% of video processing workflows online and offline involve FFmpeg VC has been downloaded at least 6.5 billion times, but likely that number cuz it's impossible to really count the number uh is much higher than that.
Virtually any operating system supports virtually any media format.
The limitation being it can't open pancakes. So, uh, can we just lay out some of the basics to to help people understand what's involved in all of this? So, when we press play on a video player like VC, what happens? What how does it go from the the file or the stream to the pixels on the screen and the sound on the speaker? What are the big stages to be aware of?
>> So, there are several stages, right? The first stage is to get from an address, right? Which is the type of URL to give you a bite of streams, right? So this would be for example HTTP file DVD, right? To give the pass to the media and gives you a stream of data. The stream needs to be cut up by what's known as the container, the demultiplexer or demox. Um we'll try and keep the jargon light throughout this, but um it needs to go and start demarcating video and audio frames. So it just gets data from the operating system blocks at the time and needs to start cutting these frames up into compressed data.
>> It then needs to start doing simple parsing of the video frames mainly to figure out whether that codec is GPU decodable or needs to fall back to software.
>> We're very sort of used to assuming the GPU will play all of these things.
There'll be hardware acceleration. I think it's up to 45% of files are not GPU decodable. So these need to be probed. They need to be detected. There can be variants of a given codec, some of which are decodable on the GPU.
Different vendors of GPU might have different capabilities. So those need to be detected. So if if it's GPU capable, you pass it through to the GPU blackbox.
So now if there's a software fallback, that means in the beginning is is to first do the entropy coding. So removing the mathematical coding of the bitstream. So this uses capabilities such as Huffman coding or arithmetic coding to actually decompress the mathematical layer of the bitstream. We then need to start reading the syntax elements for intra prediction. So intraprediction are like still images of the video. So your iframes works and operates in the spatial domain. So you do your intra prediction spatial domain. You you have a residual because your prediction isn't quite matching that of reality. So you've made a prediction but then there's a little bit left and that's what's known as the residual. This is stored in the frequency domain and these are quantized to decomp.
We then need to do the inverse transform to bring them back to the spatial domain and apply these residuals. So a lot of the process of the decoding is this thing is compressed. Yes.
>> Yes. And you have to predict the highest quality thing that's supposed to go there. I frame is the best representation you have spatially >> and then you and then there's a lot of temporal compression that can happen depending on the codec >> and then you're predicting you're predicting what the reality that was captured in this raw form.
>> Yeah. Because what people don't realize is that the compression on video and audio is 100 times, right? Like people don't realize how compressed we we do, right?
For audio you move you compress by when you go from normal audio to MP3 you compress by 10 times right when you move to video you need 100 time 200 times right so you need to remove all the details but that you don't care about because all the compressions that we do and that's very important people forget about that is to be viewed by humans right so all the codecs either for audio mimic basically how your ear works right and and a lot of things about like the the the response on the ear and same for for your eyes, right? And and so for example on video, we don't work on RGB, right? Everyone expects to work in RGB.
We don't, right? We move to YUV, which is basically one is luminance, brightness, and the other are colors.
And this matches your eyes where inside your eyes you have the cones and the buttons, right? Some of them look on brightness and more on the other on colors, right? So we need to compress a lot. And so we need to degrade but in order to degrade we need to match the human perception. And this is why it's so difficult. And then we need to use the maximum power mathematical power very complex technologies we move to the frequency domain as Karan said we do a ton of the quantizing and and in order to get the best compression but it still looks good.
>> You're trying to compress in order to maximize the highest quality thing for human perception. That is correct. And that is correct. And this is very important, right? Compression is not like a zip, right? A zip, you have data in, you get data out, right? And you try with all the the zip compression to arrive with the image. Here we are degrading the signal, right? And so we need to degrade both the audio and the video signal in the best way possible.
And we can do that, but it involves first a lot of theoretical knowledge about how it works. the I works but it a lot of mathematical uh change a lot of mical tricks right for example when you move to RGB and you you go to YUV for example what we do very often is that we scale down the resolution of the color compared to the brightness and most of the time and just this without compression it divides the size by two but most people don't see it right um and so on and so on right and then you go to very complex mathemat ometical change. So of course uh 4 years transform which the factor are not for transform they are like um discrete coinous transform but that's the same idea. So frequency domain um we split the video by blocks right so that's why when it's wrongly decoded you see those blocks and badly encoded you see those blocks and so on to arrive to compression states that are insanely high right and each generation of the codec is like 30% less >> Mhm. for the same quality, right? And this requires amount of power um of computational power that are huge.
>> No, you should you should elaborate.
It's 30% better, but an order of magnitude perhaps perhaps even two orders of magnitude more compression power. That that's the big difference.
>> What do you mean by compression power?
>> Sorry, CPU power to achieve that level of compression.
>> Oh, yeah. So, and you have to be able to leverage the CPU and sometimes GPU like you mentioned. And then we should mention that a lot of this programming uh is done at the lowest possible stack whether it's C and of course as as a legendary Twitter handle um re-emphasizes over and over a lot of assembly. So what happens is globally is that you have an address right which gives you uh with the operating system a stream of bytes a stream of data right and this is the first step and the second step arise with demoxing where you're going to separate audio video or subtitle in type of different tracks and then on each of those tracks you're going to decompress them decode them either audio with an audio codec video to video codec and subtitle to subtitle codec um and once you've decompressed those type of things you have raw images row and then you're going to talk to you with your uh graphic card in your screen and display that. And same for the audio, you're going to talk to your audio card, which then is going to go um in analog to to your audio speakers.
>> And everything we've just said in the past couple of minutes, every sentence is someone's lifetime's work. There are books about every sentence. So, the level of complexity in many cases is inordinate. You know, it's it's every sentence has thousands of people working on this in in industry as a whole.
books written about it. So there's a lot of detail, there's a lot of subtleties, there's a lot of both academic and practical realities um both of which matter.
>> Uh we mentioned Codex, but I don't think you mentioned containers. So what what's the actual containers for some of the stuff we're talking about? So people are familiar with MP4, uh MKV. So anyway, what what are containers versus uh the thing that goes inside?
>> So the container is what we call also the MXer, right? When I say demoxing, it means decontenizing, right? So actually if you look MX multiplexer and demlexer, right? MX and demox are those and same codec is actually coder decoder, right?
Um and um so containers are this collection of multiple tracks, right? So it's a what normal people call the file format but it's a bit more um subtle than that. But the most known one of course is MP4 but uh when I started it was AVI right IVI was the the video format from from uh Microsoft and move which became MP4 was a format from Apple. Um, in the open source community, one of the person that is still active on video is called Steve Lom and started this Matroska format which is like a bit more complex and and more feature proof.
Um, and um, there are so many others. So I mean there it's a pretty common thing and maybe it'll even happen in this conversation that people confuse container and the codec, right? So confuse MP4 and H.264 for example. Is that a horrible violation? No, it's not because technically the name of H.264 264 is ME 4 part 10 because ME 4 is actually a meta specification which has several things in it right there is the part two uh so there is like audio codecs right AEC is the factory is MP4 audio something there is actually several video codecs right inside the ME for specification one of them is ME for part 10 called also AVC called also H264 right so It's completely the fault of the industry to to to to make things difficult to understand. So that's very difficult so that people then don't understand why sometimes you talk about MP4 part 10 where you mean H.264 and why it's not MP4. So you can technically shove in all kinds of different codecs inside containers and horribly so. But broadly speaking though, MP4 is understood to generally be H.264 plus AAC audio. 99% of the time that's that and that the the rest are dimminimist the small effects you know edge effects really compared to that so it's not the end of the world that there are people who do get annoyed by that but also in reality something like VC just to point out the file may saympp4 but it may be something completely different and that's one of the challenges both ffmpeg and VC have is the real world is a completely different place to a threeletter file format >> and this is very important to say right like for example in VC and ffmpeg We discard the file format, right? We we look into the file to understand what's in it because so many people like they say, "Oh, it's a video. It must be MP4, but technically it's an MOV or maybe it's a MKV, right?" So, we analyze in real time everything that we have and we don't trust uh the the the format.
>> So, what information does the fact that it's MP4 give you?
>> It helps, right? It gives you a hint, right? just like oh it it's finished by MP4 I will start first by opening probing it with the MP4 container demoxer to see well it should be that but I don't trust it and if I'm lost I say okay maybe I'm going to try so it bumps the priority of the module so how do you get to uh just to take a bit of a tangent there you know the dumb thing is if you try MP4 but it turns out it's a different codec than you would have expected Uh, most players just break there. Yes.
>> Yes.
>> And so, how do you not break? It's just a philosophically, I'm sure there's a bunch of stumbling blocks along the way where you it's easy to just break and stop, freak out. That's it. How does VC not? This is why VC is popular. Um but the reason is because actually VC was is just a client of a streaming solution called video land from from from very long time ago from the late '90s and when you're playing video which are on UDP right in network they might be damaged right so you don't trust your inputs and this is very important in today's security is that you don't trust your inputs so everything in VC is prepared to um work with broken files.
Mhm. And it's a philosophical idea from the beginning and everything is engineered into that and it's a culture, right? And so for example, NVC became very popular on that because a long time ago when people were pirating content um which they do a lot less today um >> and none of us ever have >> no of course not um the metadata to play some files like AVI is at at the end of the file, right? And when you're downloading you don't have that, right?
So VC was just like, hey, this file is broken, but I'm still going to try to interpret it. And this was very useful.
We hinted at the awesomeness of the various different stages. We hinted at the awesomeness of codecs, the depth and the richness and the complexity of everything involved there. What let's try to define what is a video codec?
What what's involved there? What what does it mean to compress something? You already started to hint at it, but can can we elaborate a little bit more? So there's a huge amount of redundancy in any video uh both spatial and temporal and the point of any video codec is to remove this redundant data. Use mathematical properties as part of this reduction process. So more often than not using several orders of magnitude more compute to compress because that's more costly versus both costly both financially and in CPU resources versus the decompression. So it's asymmetric in that respect. often the case because compression is done once but there could be lots of viewers of another file.
>> So to take that information and compress it by 100x 200x removing redundant information and using mathematical properties to make that small but also have properties such as error resilience. So as as JB suggested VC in the beginning was was used to play UDP network feeds and UDP network feeds lose packets and so some of the design goals of a codec is also to be recoverable.
>> You you need to actually be able to join a stream. It's not necessarily a file.
you need to join, get on the decoding process and start decoding.
>> And and to give a more image to to to to people who are not familiar, right? Like when you're going to see any type of movie, right? You're going to see the camera is going to pan, right? And and and travel and you realize that for example, all the background is the same from for like a minute, right? Or 30 seconds, right? So you can reuse the cloud that you see uh on the background.
you can reuse that from a frame to another, right? And so it's gets the more the more memory you have, the more power, the more comparisons you can make, right? And so the more compressed you can be. And most of the modern codecs are basically doing that. So just to make it even more explicit, so what is video? Video is a bunch of pixels off an RGB. You have three values and you have a grid of pixels and you have let's say 24 or 30 or 60 of frames a second and you just have all these pixels repeating and showing different stuff 30 times a second. And so the question, the philosophical, the technical question is how can I compress all of that, store all of that at 100x >> 1,000x, right?
>> 1,000x.
>> The target is 1,000x, right?
>> And the goal is when you say redundancy, what is redundant? Meaning stuff at best that humans wouldn't notice if it was missing. So for example, you have a picture of a cloud, right? and from the next frame there's still going to be the same cloud so it's redundant you could just put it once and not do it right or you have a a black background behind me for example the black is the same on the whole picture right so you can say well you know in this picture take the pixels that you have on the top left and the one on the top right I'm not going to give the value I'm just going to tell you it's the same at the top left and then you can say for frame one um reuse something from the previous frame or the previous previous frame and so on and so on right so You could basically it's unlimited but then it's limited in terms of memory or in terms of compute power because for example if you need to compare pixels on 200 frames in the past on 4K resolutions it's a huge amount of compute and then when you're showing it you have to do the decompress of all of that. So is it the codec the has the encoding and the decoding is there's a coupled process that you're developing.
>> Exactly. Right. And those are two different um uh trade-offs. Right. Are you going to compress more uh but then it might be more difficult to to to decode? Um are you going to comp to make it a codec that is more complex to encode and easier to decode? Are you going to make a codec that is easier to encode because you need to be fast? But then the the client side the player is going to spend more time. That's why you have so many different type of codecs is that it's not always easy. And to make it even more complex, modern code decks like AV1, AV2 or VVC are actually not codecs. They are a collection of tools, right? They are multiple tools, multiple codecs in the same codec to depending on the image get the more compression. So just to elaborate codecs like AV1, VVC have a much wide have a wide audience.
It could be a screen share content. It could be video, it could be animation.
All of these require different coding tools. So what happens these days is a collection of tools are put in and called AV1 and called AV2, called DBC to allow for different use cases. So you may be on Zoom and sharing your PowerPoint and then you need to show the audience a video. That codec needs to start changing its tool set depending on the content to compress in a different way. And like you said, there's a bunch of incredible engineers behind each part of that, each part of the tools that make up AV1, for example.
>> Sure.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











