A masterful distillation of cloud resilience that elegantly balances the mathematical rigor of "nines" with the practical constraints of the CAP theorem. It successfully transforms complex infrastructure trade-offs into a clear, high-level strategic framework.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
Availability in cloud services追加:
[music] [music] Hello and welcome back to the part of us where where we take a biggest idea in technology and break down them into something that actually makes sense.
I'm your host and today we are diving into the topic that silencely affect every single app that we use. Every day we are going to realizing we are thinking about talking about the availability cloud. Here is a scenario that every college student have feel in their life. It is the last hour before your assignment deadline. You open submission portal. The page just just does not load. It killing B your friend but nothing it's feeling helpless that that's right is an availability failure and today we are going to understand exactly what causes it and how world's biggest cloud platform try to prevent it and what uh you as a future engineer get to know and I have a brilliant guest with me today someone who is getting hands on with AWS and Azure and has some Daily I don't I share welcome to the thanks for having me and honestly I wasn't just a student reading about availability I was live in front of volume because we had redundancy so I have personal reasons to care about this we are going to hear that story and more let's get right into it start from very beginning if someone is Hearing about availability in cloud context for the first time how can you define it? I will give a simple possible definition availability is answer to is my service up and up to up and running right now and users can actually access it. That's it. It's up whether your system is operational and reachable at and at any given moment and it is always expressed as percentage of time. So if my app is running on 99 out of 100 hours that that's 99% availability exactly and this is where the things get really interesting because the percentage sounds great on paper until you calculate what it actually allows in terms of real down 99% over an entire year can easily be down for 87 hours that's three of three and a half full days in every year and You are initially looking that you were promised 99% of compan that sounds like a lot of something that's supposed to be reliable right and that's why the cloud industry uses the concept called as the ninth. It represent how many nines are here in 10%. 99 99% birthday means 99.9% means 99.99 is nice and 99.999 the whole time and the long gap with each face in it is massive 39.9 allows about 8 8 hours and 35 minutes of down time per 49 brings temperature 52 minutes per per down time per year and finance 99.99 999% is only 5 minutes and 15 seconds of down time in the entire 5 minutes for the whole year. Achieving 5 minutes of time per year sounds almost impossible. What does it take? It takes extraordinary engineering and finance you cannot afford manual interview. Humans are too slow. Every failure detection, every bounding, every has to happen automatically within seconds without without anyone passing or pressing a button. It requires an automated failover and global monitoring or running 24 hours a day and not every application needs final rights that there must be a trade off. Absolutely.
cost exponentially more. Going from 39 to 49 can be double infrastructure cost.
Going to finance it might multiply several times. What do [clears throat] you always start by asking what time is actually acceptable for this specific activation? A perfect canving system but hospital cannot. So it's business decision as much as one completely and that business decision gets written into a service.
This is a format where the cloud provider commits it to a specific place.
AWS for example promises for services like screen waring with proper peration.
If they miss that commitment they owe you service credit. It's legal.
So the lines are not just a marketing numbers. They are the financial accountability. Exactly. And that according what the enterprise the computer to build critical system on a cloud infrastructure they know what they are signing up for. Let's get into engineering. What does the architecture actually look like for hiring available cloud systems? The foundational principle is presume everything will fail. Your servers will fail. Your address will fail. Your network switches will fail. Even internet data centers can go offline in a disaster. High military architecture is about designing so that one single failure can bring your persisting down. Well, [snorts] we we call this eliminating single point failure. So how do we eliminate a single point failure? You start from the ground up. You start prevention is simply having more than one entire critical with one one of everything critical two servers instead of one two network path two power supplies if one touch the other takes over in this is the best but redundancy at the server level isn't enough both servers are the same and both loses the power so which is fun yes the clock Cloud provider divide their regions into multiple physically separate data centers within the same geographic area. Each aid does its own independent power to network.
AWS Mumbai region for example has three.
They are they are close enough to have a lance between them but isolated enough to fire all of your team won't affect the others. So and you can deploy your application across multiple assets simultaneously. This is also called as multi deployment and it's standard for any production application you have run on servers always uh at the same time the load balancer sits in front of them and distributed user will be deleted.
The load balancer also constantly constantly checks whether ser once completely done the load balance instantly stops sending traffic to routing everything to his users might feel a split second income but the app.
So what about data? Because if my app is of two result but my database is only one I still have problem which is why I considering this as one of the most beginner testings. Your database needs to be highly available you manage services like offer multi where your primary database may exist replicated to standby agent 2. If the primary work fails AWS automatically convert the standby frame this over happens and in 60 seconds in a model your application recondition 60 seconds segments of data potentially also manually record that's a huge difference it is enormous and alongside that you have replicas additional copies of your database and that's why you only This reduces load on your primary and adds another layer of resilience.
So now what about failure caused not by hardware but by sudden traffic like when someone thousands of people hit your application and save memory that's where autoscaling solves the problem. You define the rule when CPU goes about 70% automatically add to more servers. The cloud platform spins up new instances within minutes distribute load across there and when the spine scales passes scales back down without scanning will join this over provision or under provision and vendor traffic subjects.
Every college port end of is a perfect example of being going wrong every single second and take 10,000 student try to submit assignment on the library and it collapses the proper scaling center and scale back down when it is over. So and finally CDC and how they contribute to the availability. A container network like AWS cloud front cloud has servers called location distributed all over the world. Then you access a website instead of request traveling to original server which might have which might be Singapore or US. It goes to nearest location maybe in Mumbai or which has catch a copy of the drastically reduces latency reduces on the original server and also adds a layer of availability. If your original server is the CD can also often serve cache at every level multi deployment load balance with check database failure over auto scaling for traffic and CDM for global delivery all these layers working together each layer hand some different failure phenomenal together they hardware failure software traffic overload disaster and network issues with it multiple safety net but let's talk about the real failure what a famous cloud outage that every student should know about the AWS yes the AWS army in December 2011 it is essential knowledge during our routine network triggered a cascade [snorts] of a that took down or degraded a massive portion of KW. The impact was economies that ended major services reported issues because they are all independent. Hey, we all know only one region even company that naturally sells reliability have measure of this. No system has even the reason the question isn't if your system will fail it's when what matters is how fast and how much you use that exactly was and manage first recovery time of the is the maximum acceptable time to a system can be done after and if your is 1 hour you must pay back online within the hour. [snorts] RPO RPO means recovery point of view is the maximum acceptable data loss in the time. If your RO is 15 minutes, you need back up every 15 minutes. So that the worst case you only use 15 minutes of data and different application will completely different. Yes, dramatically different. Social media platform might analy 24 hours using a few of painful but not catalytic. A patent process in 15 minutes needs argue of seconds and aruction [clears throat] is unacceptable. Your architecture must be designed to actually beat this target. No just end. How do you know when something is going wrong before user started calling? Proactive monitoring means everything. It tools like AWS cloud watch contines across all your services CPU memory error rates and network traffic set alarms. If error rate crosses 1% the call on call engineer will cross 5% trigger and automated respond the go to detect and issues before users even notice some something is wrong and some systems can fix themselves.
The self system is modern cloud architecture means automation on automatically start build services read out traffic away from unlaces and scare up capacity when demands by call without any room and the engineer wakes up in the morning checks the dogs and sees there was an issue at 3 and the system was already handle so let's draw something super practical For every student listening what the most important things to do actually understand cloud availability not just about it three things number one is get hands on with tire create a free tire Google cloud free account all you account to deploy for free create a virtual project put the alarm then in front of it Set up a database with multi watch scaling column reading about this actually doing it completely a different experience getting your hands dirty on real infrastructure what's number two study for foundational certification AWS cloud practitioner fundamentals both are take one or two months of regression and systematically I did the SLAs is redundancy and reliability. This do great on the listening and give you give you a vocabulary that makes technical conservation much easier. And third most important when the major cloud services outages AWS cloud AWS Google Cloud they publish daily incidents reports explaining what went wrong what the impact was and what they have changing. There are some most educational documents in software engineering. Reading 10 of them will give you more technical institution in about university beside read models. Those three habits build consistency will take over the further than most people in this field and I and one mind shift on top of that from being designing for the happy part.
Every time you build something ask yourself what happens when the part fails. What the happens when 10 times the accepted traffic is the same. build the answer of those patient and design after daily. So design from the start not just after thought that's what the that's what the separate system that sees the real world from the system that only work in the daily there are some famous theoretical concept that comes in every distributed systems and forces the gap theorem a lot of students hear about it memorization for the exam and never really understand it means in practice then we program. Yes, I and I promise you simpler than it sounds. Cap stands for three properties that system might want. C is consistency meaning every gets the most recent and accurate data.
A means availability meaning every request gets a response even during problems and and P means partition meaning the system keeps working even when the network between some end out and theorem say of these three right and here's the key inside uh network partition are As in any real world system running over partitions package may drop links go down some partition problem is basically non-negotiable uh you have a design for it which means in practice when you are opposed you are really choosing between C and A do you say do you stay consistent or do you stay here real example of system chooses consistency Over availability any finani banking databases will not transfer money the system absolutely cannot tell you trans if it is fail it cannot show you a stain pant accuracy is everything so if there is network issue and the system isn't sure the data is consist of proc it will return error rather than give you potentially wrong information you might say transaction fail Please try again. That's the system to identize the consistency or immunity.
So that's a wrap of today episode of cloud talk. If this help you understand the cloud openility, share it with your study [clears throat] group, your classmates, your college pages and this concept will show up in your exam, your internship, interviews and your first job. Until next time, keep learning, keep deploying and keep those server running. Thank you. Thank you for the platform to hear everyone. Listen, try small, stay consistent, break things in safe environment and never stop asking what happen when it fails. That question will make you better engineer than anything else.
関連おすすめ
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











