These experts expose the limits of modern compilers by proving that peak performance still demands the "black magic" of handwritten assembly. It is a sobering reminder that our most critical infrastructure relies on a level of low-level mastery that most developers have long forgotten.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Shocking performance boost of assembly code: ~100x faster than C code | Lex Fridman PodcastAdded:
JB and I have started companies broadly speaking around the FFmpeg VLC ethos. So, that's really low-level work.
So, so in most companies this this wouldn't be written in assembly. It would be accepted that C is C is fast.
Um as you can see from that C is not fast.
Uh so, here [snorts] it says 62 times faster than C.
>> Yeah, so it's taking those the ethos of doing low-level programming, real-time programming, and take using that for commercial applications. And JB and I have started companies around that. In many cases hiring developers from the open source community to use that ethos. And and so, that that's a great example of of some of the things we're doing. In most companies it would be say, "I'll write this in C and it's fast and we're done."
But actually, you can get a lot better.
For me, like some of the headaches we have is around some OS that are difficult to support, right? Because um if you look at VLC and thanks to fate and FFmpeg, we run on the last version of VLC runs on Windows XP and still runs there and runs on Windows 11.
Um we work on Mac OS 10.7 to the latest Mac OS whatever it is, right? 26. Um we work on iOS since iOS 9. Well, we are actually iOS 26, right? We support um we [clears throat] support many types of Linuxes, BSD, Solaris. The last version still runs on OS 2, right? Like there is maybe 10 users of OS 2 in the world and one of them is maintaining VLC. Then you realize that this very small team around VLC and using FFmpeg codecs and and all the other ones support more OSes than Microsoft or Google or Apple. And they have infinite amount of uh of uh power and resources. But for example, the worst is iOS. Um for in order to build on iOS 9, we need to do some very clever mixing of several version of the Xcode IDE and SDK from Apple from several version and do a type of Frankenstein version of that so that we can still support iOS 9, which is not supported at all by the the compiler of uh of uh Apple in order to still run on on 32 um on iOS 9. And you've seen on fate that it was still supporting iOS 9, right? So, so my headaches are mostly related to the the support of so many OSes. And it's important because like we receive so many people saying, "Hey, thank you. I still have my iPad 2 to watch movies and it still works on iOS 9, right?" And it's also an impact of like not forcing people to buy new hardware when it works fine. If you optimize it correctly, which brings us to what we were saying about assembly, it's also fighting like the fact that you need to buy something new non-stop while you could optimize more, which is a lost art.
You got to tell me about this lost art. Or this uh the the carriers of the flame of assembly. What What is What What is assembly? Why is it beautiful? Why is it challenging? How does it work? So, when you write assembly code, you write this using the instructions the actual processor is using directly. So, most of the time you would write in a language, let's take C is a good example, the compiler would use that to create um assembly language and machine code instructions for you based off your C code.
And there's a specific flavor of assembly that we use in FFmpeg that's called SIMD SIMD, single instruction multiple data.
So, this means for example, say say I want to add five to a number in scalar assembly, so this is what's known as a you work on an individual element. So, I want to I have a number of I have the number 10 and I want to add five. I use the add instruction and I add five to 10 and I get 15.
With SIMD with SIMD, I can have a whole vector of 16 different numbers. They could all be different.
If I want to add five to that, I can I can run one instruction and that one instruction sums all 16 elements.
And that as you can imagine lends itself very well to video video uh video is, you know, pixel grid. So, I can perform operations on multiple pixels at the same time.
The key thing that we do differently in FFmpeg is we don't use any abstractions or any major abstractions on top of that. So, there's a part of the world that uses what's known as intrinsics. So, these are C functions that behave very similarly but not quite the same to writing assembly by hand.
So, the registers that data is is stored in on the CPU, the compiler allocates those for you.
And so, the key thing to understand with when we write SIMD is we have a 10x and not percentage 10x to 50x speed improvement. That that function is 62x. That's nuts. On the FFmpeg account, as you know, posts and tweets a lot about that to try and say, "Hey, we're doing this stuff." You are a person who sees the beauty in assembly, but it's also extremely useful for these kinds of application to actually significantly outperform even C, which is crazy. It is necessary, right? Because like one of the projects that we need to talk about is called David, right? So, David is a decoder for the format that was done by Alliance for Open Media, which is an video decoder called AV1. So, it For people who don't know, we've been talking about H.264, AV1 is another hugely popular standard and codec that is increasingly taking over uh the internet. And when this format was launched, many people said, especially even from the Alliance for Open Media, right? Which is Google, Netflix, Amazon, Mozilla, they say, "Well, this format is so complex, it must be done in hardware to do decoding, right?" And well, I arrived uh with a few other people uh mostly um Ronald, Henrik, and Martin who said, "We need to have an extremely good software decoder uh because it's going to take time to have hardware." And so, we wrote this project, which is beyond insane. Um we are talking about 30,000 line of C, but 240,000 lines of handwritten assembly. Right?
Handwritten assembly. 240,000 lines.
That's incredible. That mean I mean, some of the stuff we're talking about is probably the biggest assembly uh code bases. To give you an idea, and Kieran can correct me, but I think the FFmpeg has 100,000 lines of assembly for all the codecs. And just this one has 240,000.
Um it's a VideoLAN project, of course.
Um and it is optimized at the maximum because the motto when we starting the project is every cycle matters, right? Every cycle matters because David is used in VLC and in some software AV1 playback stacks. We are talking about probably 3 billion devices, which are going to decode video non-stop because for example, 30% of the video from Netflix are now in AV1, 50% of YouTube, right?
So, and you often don't have a hardware decoder because not many devices have hardware decoder. And with David, we realized that with one or two cores, you were able to decode 720p correctly. So, it is like literally >> [laughter] >> incredible, right? It's David. Look at that. Yeah, so this is another spicy tweet from you. This is what peak video codec should look like.
79.9% assembly, 90.6% C, and 0.5% other. And what's incredible is with those tweets, which is factual, people get crazy. They're unhappy, right? They say >> For the for the last two years they go crazy. No, intrinsics is fine. The compiler is Oh, they go crazy. You cannot optimize your compiler. Auto vectorization is your fault. You don't understand. And we've tried that forever, right?
>> For two years. And two years later showing hundreds of examples of of handwritten assembly. No, no, no, you're doing it wrong. The compiler can do this.
>> So, what should actually just articulate a little clearer. So, the intuition there from the software engineering folks, when you have code like Okay, let's just take an example, C++. There's a compiler that's doing a lot of the optimization.
Yes. And the presumption is if you have a good enough compiler, if you continue to improve the compiler, you're going to generate code >> Yeah. that can perform like optimal performance. You cannot possibly beat it. And you're consistently challenging that thought that if you >> of magnitude. By orders of magnitude, hand-crafted assembly can outperform C. The two things that they tell us is, "Yeah, but modern compilers have auto vectorization, right?" Because SIMD that we're doing is vectorization. And like it's not even close, right? It's not even close, right? It's not like 5% 10% slower. It's multiple times slower. So, can we I don't know if you can say something philosophically cuz there's a lot of there's a lot of great software engineers, great engineers, great machine learning people. Karpathy will listen to this and say, "What's the intuition he's supposed to get from this? What are we supposed to Karpathy learned assembly because of the tweets, by the way. I just He's not He went He's like, "Oh, I think this is a movement."
No, no, he didn't And And you know the way he documents his work and so.
Philosophically what's important to realize is that we passed the time where hardware was going so much faster, right? We are at the end of the Moore's law. We have limitation for for AI, for memory. You need to go down in the stack and optimize more to get more power from what you have because request for power, CPU power, GPU power exploding while the hardware is not exploding in speed, right? So you what people do is that they add more cores, right? But that's basically like at some point you can't have 250 cores, right? So what we do is to take every inch of the machine.
Not just that. Not just that. We we abuse the machine. We we go and use we use the machine in ways that uh that the creator didn't expect.
Sometimes we use an instruction that's completely unrelated to what we do. We use a cryptography instruction in video processing to do nothing related. And one of other things that we do for example in David which is a bit crazy is that we don't use the function calling convention from the operating system. We should explain that. That is extremely complex, but basically usually when you do move from one function in code to another, there is a way to save the registry, the state of the CPU to enter another function.
And this is like standard. It's a bit complex. I would I would simplify this a bit. So So David does things to abuse the calling convention. You could define the calling convention as I've written a function and I want to call another function. How is the data shared between the functions? Cuz there's a convention, what's known as a calling convention.
And what David does for optimal reasons is create its own calling convention sometimes. So if I want to call Lex Fridman's library, we got we've got to agree on a convention so that I can share data with you in the assembly language space.
And one of the challenges in assembly is that every operating well not every operating system, but there are well at least four that I can think of on x86.
Linux 32-bit, Windows 32-bit, Windows 64, Linux 64.
They all have their own calling conventions and so one of the amazing things Lauren Merritt did who we talked about before was create a very lightweight abstraction layer so you could write your assembly code once and it handled all the calling convention stuff for you.
Which is always a problem because you had to manage four different variants.
But David takes this even further for speed reasons, it does its own calling convention within itself to bypass the kind of rules the rules of the rules of sort of functions and say okay, actually I'm going to call a function this way because I know it's within my library. Does that have to be special to every single operating system? Well if it's custom, no. But the the the challenge is in general, yes. And in terms of in terms of each instruction set. So the thing to always emphasize is we do this on every instruction set. So every instruction set has its own handwritten assembly.
Which is even more crazy and that that that has that matrix has got bigger in recent years because of risk five, because of arm 64, because of the new SVE, there's SME. x86 has AVX-512, AVX. So we do runtime processor detection. We see what the machine FFmpeg is running on or David's running on is capable of because you could be on a laptop from 2008 where this isn't there. Runtime detection, we set function pointers accordingly.
And then from then on off you go. Or you could be on a machine with risk five. Yes. And in all that, we don't even respect the calling convention of the operating system in order to be faster because we know that we are going to be called from within our binary. So we can share data without saving all the registry in the common way because that can lead to loading and saving registry on the L1 and L2 CPU and gets us faster. So that's why I said that understanding CPU architecture, computer architecture is key. And this is also why it's handwritten. I don't know anyone. I've never heard any other project than David doing that. This is what Kieran calls it calls it an art, right? It is an art. I think I think in the in the mass world there isn't something on billions of devices. I know there are some specialist industries. I know in high frequency trading they could take this really seriously where they're receiving feeds from a market and they need to react within x number of microseconds and so the instructions matter. But that's not a mass you know, a mass produced thing that's on a billion devices. That's hyper specialized running on hyper specialized hardware.
We're running on all hardware from Sorry to linger on it, but like that's a really counterintuitive almost like revolutionary idea here that there's a huge amount of value to assembly. Like what are we supposed to take away from that? Like what you know, there's a bunch of people listening to this they're basically like sorry from myself included, you know, I I program for many many years in C C++ going up to standards of C++, found love of C++, and meta programming and so on. And then transition more and more because of machine learning about 15 years ago to Python. And so like for me in this Python world, JavaScript world, now vibe coding where I'm just using natural language sitting in my jacuzzi drinking a drink and just talking to the computer like like record stops. Why is the value we to go back all the way down to the low level?
Because you can get more power per dollar invested, right? And sometimes it's going to be a problem that is limited by your hardware.
A good analogy is what you see in quanti- quantization in in LLMs, right?
And people are doing oh I'm going to do that in FP8 or FP FP4 or some some crazy things like Microsoft who did in 1.5.
Because you're constrained by memory, because you're constrained by the machine you can run, because at some point we are doing real time and I believe this is going to happen on AI inference also is that at some point you need to get get faster and you cannot always get harder more powerful hardware, right? So you need to analyze code and see where like where is the mission critical? Where is the things that are called non-stops? And for example, David is a good example. It's going to be run billions of hours per day. Yeah. That makes sense. It doesn't make sense to be on the glue of FFmpeg CLI. It makes sense over there. Yeah, and this has to do also we'll talk about it more, but your new effort in your company Kyber is doing that kind of thing for ultra low latency. So the slogan being every millisecond counts.
So when you actually extremely highly constrained in some dimension. We are also arriving at a point where we've done so many great things, but the hardware is getting back to us, right? Because cost is increasing, because we need more power. And so you're limited by either your CPU, your RAM, or your networking.
And you need to optimize and this is where value is going to be. Especially because like doing AI is going to help do the programming of like business, right? And so the core thing that you will not be able to vibe code are optimization for the hardware to be as fast as is possible.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











