The author masterfully simplifies complex architectural trade-offs into a high-signal primer for the modern software engineer. It is a rare example of technical communication where the clarity of the explanation matches the massive scale of the systems discussed.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
System Design Interview Course | Learn Important Fundamentals w/ google Engineer
Added:Hey everyone, welcome to Anap. This is your fan guy. Today we'll be learning system design fundamentals. In our previous video, we discovered and learned why do we need system design at the first place. And today this video we'll be learning few fundamentals of system design like scalability, consistency, partition, availability and whatnot. So let's uh move forward and start with scalability. So we start with why a system demands scalability. Before that first let's try to understand what is scalability. Scalability is a system ability to handle increasing load without dying. So this load this increasing load could be due to increasing number of users to your website or your backend services. I mean more requests to your backend services or it could be a DOS attack as well.
Just kidding.
And scalability has uh two important concept in it. One is your vertical scaling and the other is horizontal scaling. Vertical scaling is very simple. We scale in inward. It means you add more resources and CPU to your existing machines or servers.
Horizontal scaling is very simple. We scale out. We try to add more machines in parallel to process the increased number of request.
Let's try to understand this a very with a very simple analogy. Imagine we have a restaurant which has a seating capacity of 10 10 people or you can say it has five table where total 10 people can sit and it can manage up to 50 customers per hour and for this you have one main chef Coco and two waiters for this particular restaurant.
So this configuration works fine if we have if we have around 50 customers per hour. But let's say your restaurant goes viral and now your restaurant is receiving like 100 to 200 customers per hour which is now very difficult to manage because with these resources like one chef and two waiters it's it's very difficult to manage right. So the idea simple pause the video and think what we can do here.
Okay. So we have two options. One option is to get a bigger kitchen, right? You increase the size of the kitchen, right?
Which which essentially means you scale vertically. I mean within the same restaurant you're trying to buy a bigger kitchen or maybe hire a super chef.
Super chef as in hire a chef which has more experience which has which can actually manage more which can actually make more number of dishes within a period of time right which essentially is upgrading your existing resources or the other option is you have more number of ships I mean you increase the number of shifts and increase the number of waiters as well to manage the orders right you add add more shifts also you can open more branches of your restaurant let's say you can add two to three more branches within that area so that the traffic the footfall is not distributed so this is how modern uh system scales right you either scale vertically or you either scale horizontally now let's discuss the second topic which is availability also before before we start high availability concepts. Let's discuss uh another important aspect in scalability which is called elasticity.
So elasticity is the ability of the system to scale automatically. For example, for any e-commerce website or in any quickcommerce website during IPL or during your Black Friday sales or even during a big billion days on Flipkart, you would see tremendous amount of request and users trying to access the website and ordering at the same time. So we want the resources to increase automatically and this is how modern system actually work. I mean you configure them so that they can grow and shrink automatically. you don't need to always kick in and scale manually either scale vertically or scaling horizontally.
So this is this is how uh we introduce elasticity in your system. Now let's move to the next topic which is availability. So we asked this question ourself. Can user access the system when they need it? A very simple analogy would be an ATM. Imagine an ATM which which works 24/7. If it works 24/7 the ability is very high right it is always available to you whenever you step in you can cash in cash and if it goes down every weekend let's say some sometime dur during a day the is poor which is our typical Indian atoms right and we know that availability is very poor for those ATMs that means anything which is always available is is called high availability and something which is available but not all the time is called low ability. And for the availability formulas now is very simple. It's total upfime divided by the total time this has been up. Now let's uh let's look into some elability numbers and these are very popular uh spoken number during your interview your system design interview and you call it 29 39 49 and 59 availability.
when we say 29 availability I mean the nine number is appearing twice so it is called two nanobility which relates to I mean 99% which relates to that the system can experience a downtime of roughly 3 and 1/2 days for an year like per year and this is how much 99% ability means that over an year in 365 days 3.65 65 days my system is unavailable. It could be due to maintenance. It it could be due to any attack on the system or it could be due to any failure any internal failure or anything. And when we say 39 availability which is like 99.9%. It's 8.7R and and simultaneously we have 4 9s and 5 9s and typically modern systems like AWS architecture, Google cloud and most of the top tech companies they they offer 59 and 49 availability and to maintain this availability it takes a hard I mean it takes a lot I mean your system has to be very reliable so that it is available up to 99.9 999% of the time which is like roughly you have a bucket of just 5 minute unit to go unavailable but it is very unrealistic you you must have already heard of that some some AWS services are down I mean we had we had very recent newses right where also last year GCP was down for 2 to three hours we we tried to keep our ability bucket SLO to the point and also there are some other concepts within it like high availability I mean when we say our system should be highly available or it should be highly availability. It it means simply mean the system should survive the failure. I mean the like even if one server dies user shouldn't be a you should be noticing it right. I mean let's say you have 10 servers and one of them dies due to that out of memory issue. The user user experience should degrade it should be the user should be still able to interact with your websites or servers. And then the second concept is failover. Failure is like automatically switching. Let's say let's say your your one of your primary DB which which handles all the rights dies. Let's say your primary DB crashes.
So your replica should be automatically promoted to primary and so that we have some failover mechanism. And similarly we have RTO and RPO. RTO is like recovery time objective like how quickly your system can recover. Let's say within 5 minute or 10 minute let's say your system dies. We have some RTO setup so that we can quickly recover a system.
I can spin up more machines or redo the stuff so that our systems are back in place. And then we have RPO which is recovery point sorry recovery point objective which is like how much data loss is acceptable. For example, let's say 30 second, 1 hour, 20 minutes. I mean, if the system goes down, how much we are fine with the data loss and these terminologies are quite popular when you do interview system design discussions.
So, interviews love R2 and RPO sort of discussion if you are discussing systems which which demands highly high highability. Now let's discuss the third topic which is consistency and another very very important concept which simply means that all user sees the same data.
Now imagine you have a Google doc you type hello and your friend the other person who is viewing this doc immediately sees hello.
This is called strong consistency. So now we know that consistency simply means all users sees the same data but we we are not so this has some hidden this this definition has some hidden gap in it right all users see same data but at the same time or is it fine to see the same data after some particular interval of time. So when we say strong consistency that means we want the we want the data to be same as soon as the data is available. I mean as soon as there's a change in the data it should be available to all the system to all the user let's say I made a profile p change as soon as I made the change everyone should be able to see my new profile picture or let's say imagine Instagram you uploaded a photo your friend sees it let's say 2 to 3 second later which is called eventual visit and we are fine with it and some some common examples of strong consistency and eventual consistency is strong consistency like your bank database Let's say you added 10 rupees to your bank account and at the same time you want to pay after getting added 10 rupees you want to pay someone like you want to you pay 10 rupees if if the if the data is not available your data in your database that that 10 rupees transaction then you will try to pay if it is not available the system inconsistent right and the same evential consist like we are we we say we we are fine we are fine with if the data is not available as soon as it it it was made I mean it was in the database.
Okay. So now now discuss uh why this consistency become hard hard to solve or hard to discuss during an interview or in your system design concepts.
So let's say you have a primary database and you have multiple replicas of it.
Let's say you have two replica of your primary database, right?
Let's imagine your primary database.
This this database holds the number of followers you have on your LinkedIn.
Let's say I have around 70k followers.
So just recently just after watching this video let's say thousand folks went ahead and subscribed me. Now my primary database has around 70k followers but my read replica still has 69K. I mean both.
So this this this creates consistency gap, right? I mean your primary database has more I mean uh the uh the most recent number of followers but your replica databases which are like snapshots of your primary database has less number of followers. So which answer should user see? I mean whether it should see 69 or 70 that depends the choice of consistency that we want to our system have. I mean if you're fine with eventual consistency we'll say no harm in seeing 69k because nothing nothing will change. I mean this is not like a P 0 thing that we want it right away. It's not like a bank architecture or banking system where you want the data should be available as soon as the transaction was made. So it's fine. So definitely this this comes with a uh consistency challenges. I mean you we might see m multiple consistency but you have to come up and understand whether this system demands strong consistency or it demands eventual consistent I mean if it is fine to have eventual consistency okay so also let's also discuss other consistency level other than the strong consistency and eventual consistency and the these two are very very important terminologies and you might be hearing a lot from your senior engineers or during system design discussion in your team some other consistency levels are casual consistency. It is simply like cause and effect which which which is like you preserve the order. For example is comment appears after post. I mean I know but this this is like this is like very popular. I mean it it should be not before I say the data is available. It should not populate comments even before the post is available. I know it's a very vague example but yeah and the other is like read your rights.
And believe me this is this is very common. I mean you might even have noticed this user sees their own updates immediately and but the other other side of the folks see it in later point of time which is which which which gives the sense of that feeling to the uh to the to the person who is posting that his his post is available but actually it is like eventual consistency it it comes after certain period of time let's say 2 seconds 3 second 5 second but it is very common in large systems like like Google like Facebook, Meta, I mean they they whenever you tweet something, you immediately see in your feed, right? You post, let's say you post something on LinkedIn, it directly comes on the top of your feed, but it takes some time to propagate. I mean, it takes time to fan out to others feed, right? So that's why it says read your rights.
Okay. So now let's uh discuss the another very important very very very important concept throughput and latency. The bandwidth we'll discuss it later but as you can see in this image this picture you might have already understood what is latency and what is throughput. If not let's see let's see what is latency first. So latency simply means how much time how much time does a request take like end to end. Let's say you made some search request and it takes 100 millisecond. I mean you as soon as you hit hit enter on Google search for a particular query and as soon I mean by the time you receive the output it's the latency of that particular request.
Let's say if it is 100 the lat we say the latency of this request was 100 and what is throughut? So throughut simply means how many requests was processed per second. Let's say at the same time when you were quering how many concurrent user were also doing the same thing. I mean how many requests was getting processed per second. So let's say if it was 10,000 queries per second.
So we say throughput is 10k QPS which is queries per second.
Okay. Also I want to throw some analogy on it which is like let's consider restaurant right? I love it. I mean the the restaurant analogy fits in almost every system design concept. So in restaurant the latency is simply how for how much time the customer waits for getting his order and the throughput throughput is how many orders restaurant is serving at that time. Let's say if order order order order wait time is like five minute it's the latency I mean I receive the order after five minute you might be seeing on the display board right we see your orders getting prepared and it it like expected time is like 4 minute 3 minute it gets reducing I mean overall let's say it's five minute and the throughput is how many number of orders are listed there I mean how many orders the restaurant is serving at the same time let's say it could be it it is serving let's say 100 customers per hour okay now comes the very very important concept Except it's called latency matrices and you must be hearing a lot about this term like what is a P50 what's P90 what's P P99 for this particular for this particular API during during observability I mean in a company you might be seeing you might be dealing with this numbers a lot of the time so let's discuss this so these term are typically used to describe latency matrices how long it takes for a system or the service to respond that request and so these numbers are like P50 P75 P90 the AP95, AP 999 and it refers to the percentile I mean P50 percentile 50 P75 percentile 75 and believe me this gives you better sense of how response time are distributed in a typical in a typical system I mean in a typical distributed environment so let's discuss P50 P50th percentile and this is a medium yeah that's a medium half half the people get their food faster than this again theology I want to keep it here. So half the people get their foot faster than this and half the people wait longer than it. Example is very simple. Let's say out of 100 customer 50 people got their foot in less than this time. Let's say uh 1 minute. So let's say if I say for any particular API 350 is 1 second. It means 50% of the user had the response time less than 1 second and the 50% had to wait more than 1 second. And we still don't know about that. And the we we still don't know the upper boundary right we know that lower boundary that is that 50% were less than 1 second and the rest 50 is 50% of the people we don't know the upper but they were definitely above 1 second so let's say P50 is 50 minute it means half of the order were sold within 50 minute or less now let's consider P75 this is a point where like 75% of the people got their food faster than this particular time and 25% of the people had to wait for more than 20 minutes and again this gives again a better idea like what is the distribution right and then P99 like P99 is like very important metrics and uh we also call tail latency because is it is like we we we now have more confidence when we say our API let's say let's say get food get food in a restaurant get food API P99 is let's say 5 second 5 second or let's say five let's say 40 minute P9 let's say 40 minute let's give some realistic number.
So that says if P9 is 40 minute it means 99% of the customer receives their order in less than 40 minutes and only 1% 1% people had to wait longer than this right and this this sets a clear picture of a system how well it is performing and it is very very critical in large system like distributed large systems that's why this tell P99 tells that yeah most users are happy some users are like miserable like 1% people had to wait more than 40 minutes and This P99 most of the time reveals the critical issues like hidden hidden crit critical issues within a system and and this is this is very popular also we we call the tail at scale problem right it's a very famous concept because this 1% request with this 1% request you'll study and you'll get to know what is what is going wrong in your system that you need to fix. Perfect. Let's move forward.
And then this question come why latency metrics are useful? One one ask okay I know you have must have got enough idea but again I want to continue on this. If you look at P99, you see that there's a small group of people who experience much longer wait time and in in a system design context that means a you a few user might experience worse latency than most of the others and you might need to work on improving the system for that small group and we discussed just now whereas P50 and like P75 are often used to how the majority of the users experience like 50% or 75%. Perfect.
And then there's a very important concept called little loss. Right? Let's let's try to understand why this little loss exist.
Okay. Now imagine imagine you're running a payment service and you know you know like thousand requests are arriving every single QPS has 1k requests per second and let's say each request takes 2 seconds to process. I mean 2 seconds to complete like latency is 2 seconds.
So even a natural question come can can come to you is like caliper how much how much or maybe let's say how many requests are sitting inside your system right now and if if you if you are like senior engineer or let's say SD SD2 you know this we are talking about what are my inflight request what is the number of the infight request and one might ask why this is important I mean why why do we why do why do why do I care how many requests currently my server is dealing Right? This matters because every request comes with a cost, right?
Because it it consumes your CPU, some part of your CPU. It consumes some part of your memory. It consumes your threads, your active threads. It takes it takes a hit on your DB connections.
So if too many requests are in parallel waiting to be executed or getting executed, this this tells the probability that your system might be under under a load or it might crash as well.
So to to calculate this number the math very simple just to you just multiply you just multiply the throughput and the latency and you get your number of requests that are in flight or sitting in your system right now suppose you have 5,000 requests arriving per second and the average latency let's say 200 millcond it's like 02 second your uh concurrency I mean this ultimate concurrency is like you multiply and you get thousand thousand thousand requests are actually being processed at that moment at that moment at that same point when there was 5,000 requests and latency of 2 seconds and let's consider this was like pay 999 lat this was P99 latency and I can I can roughly and estimately say thousand requests are being processed and this immediately tells us that how many threads might be needed to process this request I mean we we asked this question before actually designing the system or designing the API we we we asked the interview how much QS is expected and then we can do the reverse engineering and and understand how many active threads are needed that will tell you how many CPU cores are required and also this will give you enough idea on how many DB connection we might need to open and also memory requirement but memory is cheap okay the last topic of this video which is partition okay let's talk about partitions partitions are very important I mean it's I believe it's one of the most important concept in distributions it's in distributed system and you will know it even better when we'll be learning capum and tackle here in the next video but also let's learn what is a partition so a partition like a network partition can occur when a system when when two system are not communicating I mean the both the system might be live but they are not connected let's say you have your Mumbai data center let's say your your cloud or let's say you or let's simply assume you have two data centers one is in Mumbai and one is like in Delhi they usually communicate with each They're working fine but on a random day on a random day the network pipeline got broken or the let's say cables were cut now both system are alive there's no problem in the data center but the connectivity has lost let's say what can happen in the worst case I mean this is partition I mean the the the connection is broken now right neither can talk but they both still are alive only the network is broken and when we say okay why it is so important I mean yeah they both are not connected But they are still working right. I mean they both can they both can serve the traffic but no I mean they might say servers are alive like communication work and no but in reality network failure leads to leads to majority problem. I mean I know I mean you can say network failure happens really but your system should be your system should be resilient. It should be reliable. It should be reliable to partition. And this gives birth to a very important concept a very important concept of cap theorem. I mean the system should be tolerant to partition.
Okay.
Let's also talk about latency versus partition which is again we will give more idea to you. So let's say slow network often looks like a partition, right? Let's say uh your your request your request for let's say you're fetching you're trying to listen a song on YouTube music and you're requesting a song and it's taking some time to start right you might you might say it's either this internet could be slow and you as a system design perspective you might think there's some partition right the the servers are disconnected maybe and distributed system must choose when this when this thing comes we must choose either to wait longer or to fail fast. I mean in such situations where where various systems are unresponsive because there's some partition or let's say there's some brokage between the two servers.
One might say there's a design choice either you have to wait longer. We'll wait we'll wait for let's say 10 minutes which is which is not realistic but yeah let's say we'll wait for 10 seconds or we'll fail fast and say it's a failure and say uh let's say not found that sound song was not found or the servers are not reachable and this this this this decision deres a cap theorem which is like very very important and we'll be learning about this cap theorem because this this this is like this lays the foundation of your design choices. I mean all the nonfunctional requirement that you will be gathering during the system design interview is is based on cap theorem and we learn also the advanced version of cap which is like pleum I mean cap theorem talks about consistency allability and partition and it says in in real life in real life you will never you will be never compromising on partition so your system should always work despite of network partition.
Right? It should be partition tolerant.
So we always keep the P and always negotiate between consistency and availability. So it says you can either have consistency with partition tolerance or aability with partition tolerance. You can't have cap consistency and together compromising partition tolerance. No, you can't have because we just discussed right if your two centers are not connected you you can't guarantee consistency because these two data centers might have different different data because they are not connecting they are not connected to each other or also you can't gu guarantee availability at the same time because your system is not irresponsible it is itself failed to communicate so that's why we we never I I mean we'll talk a lot about on this in the next video also if you're liking the discussion or if you see any anything that is I'm missing or if you that above if this thing was available in the video if if you have explained this way or if you have given more examples or if you have given more visuals I would love to see that feedback please put in this comments and believe me a lot more quality content is planned and will be posted soon please like share subscribe and hit that hype button so that I also get motivated and bring up with more advanced concepts in system design and make your system design interview feel like easy Easy peas like crazy. Chal see you.
Related Videos
LBF101 Creating an XML Changelog
liquibase7511
3K views•2026-06-15
Alta Labs Cloud Dashboard Real time Network & Xnet Insights!
ShinyTechThings
158 views•2026-06-17
Wait... Group Policy Not Applying? Check This First!
keeplearning_iT
144 views•2026-06-15
Leetcode Weekly Contest 506 | Life's boring these days
Pudeesht
2K views•2026-06-14
microJAM: MAKING A MICRO GAME FOR A GAME JAM IN CLOJURESCRIPT AND TOTALLY NOT C
janetacarr
156 views•2026-06-18
Partitioning vs Bucketing vs Clustering: How to Make Queries 100x Faster
thedataandaiguy
194 views•2026-06-16
Design Claude Code Like a Senior Engineer
hayk.simonyan
344 views•2026-06-19
Linus Torvalds: AI Won’t Replace Understanding Code
SavvyNik
140 views•2026-06-19











