This deep dive masterfully exposes the gap between WebRTCβs simple API and the complex infrastructure required to scale real-time media. It is a lucid reality check for developers who mistake a basic peer-to-peer connection for a production-ready system.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
WebRTC Deep Dive: The Protocol That Powers Every Video CallAdded:
When you talk on a Zoom call, your voice reaches the other person in under 200 milliseconds. That is faster than a TCP handshake to most servers on the internet. Faster than your own browser can pull a JSON response from your own back end. Now, think about that for a second. What protocol is the browser using to pull this off? It is not HTTP.
It is not websockets. It is WebRTC. And most engineers have never written a single line of code against it. By the end of this video, you'll know exactly how WebRTC works, why it is the only way Zoom, Discord, and Google Mid could exist at the scale they do, and the one architecture decision every team building realtime video gets wrong on their first try. Let's get into it.
Realtime communications is everywhere now. remote work, telly health, multiplayer games, live collaboration tools, customer support, video calls, every modern product seems to have a video tab somewhere. But very few engineers actually understand how any of this works under the hood. Web RTC is one of those rare topics that shows up in senior system design interviews and almost never in your day job, which means if you know it well, you stand out a lot. So this is worth of your time.
Now to understand why WebRTC matters, you have to remember what video calls used to look like. Before 2011, if you wanted to do voice or video in a browser, you needed Flash or a Java applet or some custom plug-in the user had to download and install. Google Hangouts had a browser plug-in. Half the video products on the web shipped their own sketchy little binary that ran with way too many permissions. It was painful for users, painful for developers.
Security holes everywhere. different plug-in for every browser. Half your users could not even get the call started. Then in 2011, Google open sourced a project called WebRTC. By around 2017, every major browser had it built in. Chrome, Firefox, Safari, Edge, no installs, no plugins. The browser became a real-time communication endpoint the same way it had always been an HTTP endpoint. You do not install HTTP, you just use it. And now the same was true for peer-to-peer video. Once the browser could do real time on its own, every modern video product you use today became possible. Google Meet, Discord, Slack, huddles, Microsoft Teams, all of it sits on top of WebRTC.
If you have spent your career writing HTTP services, you're used to one model.
Client ask server answers done. WebRTC does not work like that at all. So before we look at any code, let us first build the right mental picture. HTTP is a transaction. You send a get request, you get a response, the connection effectively ends. You send a post, you get a response. Done. Stateless, predictable, and built on TCP. Web RTC is a session. Two peers negotiate a connection once and then they keep talking. There's no get or post.
Instead, before any audio or video flows, the two browsers have a small back and forth conversation. One side sends a description of what it can do, what video formats it supports, what audio formats, what its network looks like. The other sides reads that, picks what it can match, and sends back its own description. Once both side agree, the call begins. That little description document has a name. It is called SDP or the session description protocol. Now, do not worry about the name for now.
We'll look at it more carefully in a bit. Just remember the idea. Two browsers exchange a description of their capabilities. Agree on what they can both do. And only then do they open a direct pie between each other. That pipe stays open for as long as the call last.
And here's the most important part. Web RTC runs on UDP, not TCP. Most engineers spend their entire career building on top of TCP without thinking about it.
Every HTTP call you have ever made, every database query, every gRPC request, all of it writes on TCP. TCP is the protocol that makes the internet feels reliable. So when you send a packet over TCP, it is guaranteed to arrive in order exactly once. If a packet gets lost on the way, TCP detects that and resends it. If a packet arrives out of order, TCP buffers them and hands them to your application in the right sequence. You write fetch and you get back exactly the bites the server sent.
No surprises. UDP is the opposite. UDP is just here's a packet, good luck. No retries, no ordering, no guarantees. If a packet gets lost, it is gone. If two packets arrive out of order, they show up out of order. If a package shows up corrupted, your app deals with it.
Sounds terrible, right? Why would anyone use this? Because TCP's reliability has a cost, and that cost is time. When TCP detects a lost packet, it stops everything and waits for the retransmission. That pause might be 100 milliseconds, 200, sometimes more. Your whole stream blocks until the missing piece arrives. For loading a web page, that is fine, but for a live voice call, it is a disaster. UDP is unreliable on purpose. Packets can arrive out of order. Packets can get lost and never come back. For HTTP, that would be unacceptable because your bank balance cannot show the wrong number because a packet was late. But for video and voice, it is the opposite. A packet that arrives 300 millconds late is worse than a packet that never arrives. You would rather drop one frame of video than freeze the entire call for half a second, waiting for that frame to retransmit. Your face can show one feared frame and nobody dies. Your voice can skip a tiny syllable and the conversation keeps going. TCP optimizes for correctness. UDP optimizes for time.
Web RTC time because in real time communication late data is dead data.
And that decision shapes everything else about how this works. And speaking of voice, I just wrote this entire design doc, problem statement, architecture, trade-offs, and I didn't type a single word. This is fill voice. It's a dictation tool built for people who actually write for a living, engineers, PMS, anyone who lives in dogs and Slack all day. So here I'm going to design a distributed rate limiter. I am on my Mac and I just press the function key which I assigned as a hotkey to Willow. Willow activates instantly. Let's go. We need a rate limiter that handles 10 million requests per second across regions. The main constraint is uh consistency. Token bucket gives us bust tolerance. Sliding window is more accurate, but it's expensive at scale. So, let's not do that.
And here is what it came up with. I hope you noticed that I took intentional pauses and I sometimes use filler words like m like and yet willow was able to strip out everything and make it into a clean paragraph style sentence. And this is not just plain text. If I say bullet points, it formats them as a bullet points. All right, let's go to the architecture heading architecture. Next line bullet point, we'll use radius with consistent hashing. Each node owns a shard of the key space. Bullet point for atomicity. Luis scripts handle the increment and check in a single round trip. New heading trade-offs. Reddis gives us speed but adds a dependency.
Bullet point a sidecar approach keeps latency lower but complicates deploys.
Now look at this. It got radius, Lua, consistent hashing all in a context and it works everywhere you write. Notion, Google Docs, Gmail, Slack links in the description. You can use my promo code at checkout and get 20% off premium for 12 months. Now, here is something that confuses a lot of people. Web RTC is not actually a single protocol. It's a bundle of protocols that all work together hidden behind one browser API.
We'll see the actual JavaScript in a few minutes, but before we look at the code, you need to know what is happening underneath. Because when something breaks in production, it will not be the API that fails. It will be one of these layers underneath. So let me walk you through the seven players. There is SDP which we already touched on. It is a session description protocol. Just a text document that lists what each site can do. My site says I can send video using this coding audio using that codec. By the way, codec is just a fancy word for a way of compressing audio or video. So it fits through a network without using too much bandwidth. So both sides have to agree on a codec they both can handle before any media starts flowing. Then there is ICE which stands for interactive connectivity establishment. ICE is the algorithm that figures out the best network path between two peers. We'll spend a lot more time on this in a minute because finding a path between two random laptops on the internet is actually really hard. Then there is su or stun, a small server that helps each peer figure out its own public address. And we'll come back to why that is hard. There is turn or tur a relay server that kicks in when a direct connection between two peers is actually impossible. Roughly 10 to 15% of all web RTC connections cannot establish a direct path and have to fall back to turn and we'll come back to that one too. Then there is DTLS which is just TLS for UDP. The same encryption handshake your browser does for HTTPS but adapted to work on top of UDP instead of TCP. And this is what keeps your call private. There is also SRTP, the secure realtime transport protocol.
This is what your actual audio and video flow over once the connection is up. The S is for secure. SRTP packets are encrypted courtesy of DDLS. And finally, there is RTCP, the control channel that runs along SRTP. RTCP is how each side tells the other how the connection is doing. Are packets getting lost? Is bandwidth dropping? Should we lower the bit rate? RTCP is the feedback loop that keeps the call adaptive. All seven of these are standardized by the IEF and the W3C, the two big internet standard bodies. That is why your code works the same on Chrome, Firefox, Safari, and Edge. Browsers do not get to invent their own version of WebRTC. Seven protocols, one API on top that is web RTC. Now let us talk about the hardest part of making this thing work in the real world. So you are sitting at home on your laptop. Your friend is sitting at her home on her laptop. You both want to start a video call. Now it might surprise some of you, but your laptop does not have a public IP address. It has a private one, a local address that only makes sense inside your own home network. Your laptop thinks it is something like 192.168.1.42.
So does your friend's laptop. So do about half a billion other laptops on the internet right now. They're all using the same private address ranges hidden behind some routers that translate between the inside world and the outside world. And that translation is calledNAT, network address translation. Every home router does this. The reason it works is that when you make an outgoing request, like loading a web page, the router remembers it and routes the response back to you.
The outside world only ever sees the router's public IP, never your laptop's address. But this creates a real problem for video calls. To call your friend directly, you need to send packets to her laptop, not to her router, to her actual laptop sitting on her couch. And from the outside, her laptop is invisible. There is no address you can put in a packet that will reach her. So how do you connect to your machine? You cannot even address. And this is the single hardest thing WebRTC has to solve. And the way it solves it actually quite smart. Let me walk you through the steps. Step one, both of you connect to a common server somewhere on the public internet. This is called a signaling server. You build this yourself, usually over websockets. Its only job is to be a matchmaker. It introduces the two peers and then steps out of the way. Webbrtc does not even specify how signaling works because it does not care. You decide. Step two, each peer talks to a stun server and basically asks, "Hey, what does my public IP look like from your side?" Stun replies and now both peers know what address the outside world sees them on. Step three, each pier collects a list of every possible address it might be reachable at. local network address, public address from Stan, maybe a relay address from turn.
Each one of these is called an ICE candidate. A candidate address that the other side might be able to reach you on. Step four, both peers send all their ICE candidates to each other through the signaling server. Then they start firing connections attempts at every possible pair of addresses. ICE keeps trying combinations until one of them works.
The fastest path wins. Now the connection is up. DTLS does the encryption handshake. SRTP starts carrying media. And from this moment on, your video and audio flows directly between the two laptops. They do not pass through the signaling server. They do not pass through any server you own.
The signaling server was just the matchmaker. Once the introduction is done, it goes out. And that is what people mean when they say WebRTC is peer-to-peer. Now, all of this sounds like a lot, but the API the browser gives you is shockingly small. So, let me show you the entire connection flow in about 20 lines of JavaScript. Now, quick note before we start. You'll see a thing called signaling channel in the code. That is just my placeholder for whatever connection you have to your own signaling server. Usually, it is a web soocket. The browser does not care what you use as long as you can pass messages back and forth between the two peers.
With that out of the way, here is the flow. You start by creating a peer connection. The only thing you tell it up front is which stun server to use.
Google runs a free public one and that is what most tutorials point at. Then you grab the user's camera and microphone and you add those tracks to the peer connection. This is what you want to send to the other side. Then you create an STP offer. This is the description we talked about earlier. It says here is what I can do, what video and audio formats I support and what resolutions. You set it as your local description, meaning your side of the agreement, and you send it to the other peer through your signaling channel.
When the other peer's answers come back, you set it as the remote description.
Both sides now agree on what the call will look like. So, whenever your browser discovers a new ICE candidate, you forward it to the other peer, and they do the same to you. ICE keeps testing peers until something connects.
And this glass handler fires on the receiving side. When the other peers video and audio actually start arriving, you point a video tag at the incoming stream and you have a working video call. And that is it. About 20 lines of peer-to-peer video. The hard part is not the API. The hard part is the signaling server you have to build and the architecture choices you make. Next, one more thing before we move on. Web RTC is not just for video. There is something called RTC data channel that lets you send any binary data between peers on the same connection with the same low latency. And this is how Figma syncs cursor positions. How some multiplayer games handle real-time game state between players. How peer-to-peer file transfer apps work in the browser. Web RTC is a realtime pipe. Video just happens to be the most popular thing to send through it. Let me show you this actually working. The code we just looked at uses the camera. For this demo, we are using the same API but capturing the screen instead. Same web RTC flow underneath. So here we have built a simple browser based screen sharing system where two devices can connect directly with each other in real time. Inside the application we mainly have four core functions working together. Init receiver prepares the receiving device and generates a unique connection ID. Get share stream captures the desktop screen securely from the browser. Start sharing establishes the P2P connection between both devices and starts a live stream. And finally, stop sharing cleanly disconnects this session and stops the stream whenever needed.
So, first we start a lightweight local server using Python -m HTTP server which hosts the application on port 8000. Then using VS Code's built-in port forwarding feature, we expose that port publicly.
So, the application becomes accessible from anywhere through a shared link. Now on the mobile phone, we open the shared link and click on receiver. The system instantly generates a unique ID for that device. On the desktop side, we open the same link and select sender. Here we paste the ID generated from the mobile device and click on start sharing. So at this point, the application requests screen sharing permission from the desktop browser. Once allowed, a direct peer-to-peer connection gets established between both devices using WebRTC. And as you can see, the desktop screen is now streamed live directly to the mobile phone in real time without using any dedicated streaming server in between.
So overall, this demonstrates a lightweight real-time screen sharing solution running completely inside the browser with direct device-to-device communication.
So that is peer-to-peer web RTC working end to end. Just two devices, no streaming server in between. All right.
Now, let us talk about why this peer-to-peer design falls apart the moment you try to scale it. When just two people are on a call, WebRTC works perfectly. One stream up, one stream down for each person. If you add a third person, now things start to get interesting. Each person has to send their video to the other two. So, everyone is uploading two streams and downloading two streams. Add a fourth person, three up, three down per person.
You spot the pattern with n people in a call. Every person has to upload n minus one streams and download n minus one streams. The total number of streams flying around the call grows as n * n minus one. Quadratic. Let us put numbers on that. 10 people in a call means each person is uploading nine simultaneous video streams. Your home upload bandwidth probably cannot handle that.
Even if it can, your laptop is encoding nine separate copies of your face in real time. Pure peer-to-peer web RTC works for two people, maybe three, maybe four if you're lucky. Beyond that, you need a different architecture. So, how does Zoom put a 100 people in one call?
How does Discord put a thousand people in a stage channel? They do not use pure peer-to-peer. Nobody does in production.
They use something called an SFU. SFU stands for selective forwarding unit.
The name sounds fancy, but the idea is actually simple. Instead of every peier talking to every other peer, every pier talks to one server in the middle and that server is the SFU. You upload your video to the SFU exactly once. The SFU forwards a copy of your video to everyone else in the call. You upload one stream, you download N minus one streams, linear, not quadratic. And suddenly a 100 people in a call is not insane anymore. And here's the cool part. The SFU does not decode your video. It does not re-encode it. It does not even really look at the contents. It just rewrites a few packet headers and ships your packets to the right destinations. Cheap and fast. Basically, a smart router for media packets. And because it is sitting in the middle, it can do things no peer-to-peer setup could ever do. For example, imagine you are in a Zoom call with 20 people and everyone is shown in a small box on the grid. There is no point sending your full HD video to someone who is only seeing a postage stamp sized version of you. So the SFU does not. It sends them a low resolution copy instead saving bandwidth on both sides. This trick is called simalcast. Your client actually sends two or three quality versions of your video at the same time and the SFU picks the right one for each viewer based on how big your video is showing up on their screen. So if someone is muted, the SFU just stops forwarding their audio. Saves bandwidth across the entire call. If one viewer's connection is dropping, the SFU can downgrade just that one viewer to lower quality without affecting anyone else. And this is what people mean by selective forwarding. The SFU is making smart perview decisions about what to send. Now, there used to be another architecture called the MCU, the multi-point control unit. The MCO would actually decode every incoming stream, mix them all into one composite video, and send that one stream out to everyone. It worked, but decoding and re-encoding video at scale is incredibly expensive. CPU costs 10 times what an SFU cost. In 2026, almost nobody uses MCU anymore, except in some old corporate phone systems where you have to bridge a video call with regular phone numbers. So in practice, you have two real choices. Peer-to-peer for tiny calls, SFU for everything else. And every serious video product you have ever used has chosen the SFU route. Now, let me show you what that actually looks like at scale. Take Discord for example.
Hundreds of millions of users, voice channels with sometimes thousands of people listening at once. How does the architecture actually fit together?
So when you join a voice channel, your Discord client does not connect directly to other users. It does not even connect to one Discord server. It connects to a regional voice server. A server that lives in the data center geographically closest to you. Discord runs these all over the world. North Virginia, Frankfurt, Singapore, Mumbai, Sydney.
The goal is to keep the network roundrip short because roundtrip is everything in realtime audio. Now that regional voice server is essentially an SFU. Descord uses Elixir for the signaling layer, the part that sets up and tears down calls, manages who is in the channel, handles the matchmaking, and then C++ handles the actual media streaming. The hot part where every microcond counts. The core job of the SFU is the same as any SFU.
Take incoming audio packets from the clients, forward them to the right other clients. Do not decode. Do not re-encode. Just route at the speed of light if possible. Now, here's the part most engineers get wrong about Discord.
People assume that because WebRTC is peer-to-peer, their voice in Discord goes directly to their friend machines.
It does not. Your voice goes to Discord's voice server and Discord forwards it. So, your audio absolutely passes through Discord's infrastructure.
Web RTC in production almost never means peer-to-peer. The voice server only handles the audio. Everything else, the chat messages, the typing indicators, the presence updates, the push notifications, all of that flows over Discord's normal API and websocket infrastructure. Discord uses WebRTC only for the one job it is good at, realtime low latency audio. Everything else uses different tools. Now, there is one more piece worth knowing about. When you speak, your client encodes your voice with the Opus codec, which is the audio codec the WebRTC ecosystem standardized on. Opus is really good at this. It can drop down to 6 kilobits per second when bandwidth is tight and scale up to studio quality on a good connection. The voice server forwards your Opus packets without decoding them. Each listener decodes locally. So, Discord servers, even though they are in the middle of every voice call, never actually listen to a single word. Notice the pattern.
Nobody runs production video or voice at scale with pure browserto- browser Web RTC. The browser API just gives you the basic building block, the architecture, the regional SFUS, the codec choices, the signaling layer. That is what makes it actually work for hundred of millions of users. And that gap between the API and the actual architecture is where the real engineering work sits. Now all of this sounds clean on paper. In real production though there are bunch of things that will catch you off guard. So let me walk you through them so you do not learn the hard way. Turn to Tur servers cost real money. Remember when I said about 10 to 15% of connections cannot establish a direct path because some networks are extra hostile to peer-to-peer.
strict corporate firewalls, office networks that block UDP entirely, certain mobile carrier setups. For those connections, you need a turn server to relay every single packet between the two peers. And turn relays full media bandwidth both directions. If you have a thousand simultaneous calls and a 100 of them need turn, that is real money and real ops work. Now, the standard self-hosted option is a project called Cotton, but operating it well is its own job. The signaling server is your problem. Web RTC does not ship one. You built it yourself, usually as a small websocket service. The message format is up to you. JSON works fine for most teams. The browser does not care as long as offers answers and ICE candidate makes it from one peer to another. And mobile networks hate UDP. Some carriers throttle it. Some block it on certain ports. So your turn server has to also supply relaying over TCP. and over TLS as fallback transports. The idea is if UDP does not work, try TCP which most network allow because it is what HTTP uses. Otherwise, your app silently breaks for some percentage of users on mobile data and you will not know why until the support ticket piles up.
Moreover, Safari and Chrome do not always negotiate codecs the same way.
iOS has its own special rules around when get user media works. So you will have to fix these bugs the hard way.
Recording is a separate problem entirely because SFUs forward packets. They do not decode them. So if you want to record a call, you need a separate component that joins the call as a participant, decodes everything and writes a file. This component is called a recorder and it is its own little distributed system. These are the things that turn a working demo into a real product. So do budget for them. If there is one thing I want you to remember from this video, it is this. Web RTC is how two PS talk. The SFU is how many PS talk efficiently. Everything else, Stern, turn, eyes, SDP, all of it is just plumbing to make those two ideas actually work on the messy real internet. In the next video, we can go one level deeper and actually take apart Discord's full system design. How they handle a thousand people in a single voice channel. I usually save this kind of breakdown for my co- students, but if enough of you want it on this channel, drop a comment below and I'll make it happen. Do like and subscribe so you do not miss it. I'll see you in the next one.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsβ’2026-05-28
How agent o11y differs from traditional o11y β Phil Hetzel, Braintrust
aiDotEngineer
450 viewsβ’2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanationπ―β
LearnwithSahera
1K viewsβ’2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsβ’2026-05-29
Search Algorithms Explained in 60 Seconds! π€π¨
samarthtuliofficial
218 viewsβ’2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsβ’2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 viewsβ’2026-05-29
π BCS613C Compiler Design | Module 1 to 5 Schema Evaluation π₯ | VTU 6th Sem π― #VTU #bcs613c #exam
Pranavaa-y4y
104 viewsβ’2026-06-02











