Raft is a distributed consensus algorithm that elects a leader to manage replicated logs across servers, using leader election with random delays to prevent conflicts, two-phase commit (prepare and commit) for write operations requiring majority votes, and periodic heartbeats to detect leader failures and trigger new elections, ensuring all nodes agree on state even when some fail.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
The hardest problem in databases: consensusHinzugefügt:
All right, it's time to get your rafts.
And, no, I don't mean your boats, I mean the Raft consensus algorithm, which is what we're talking about, and it's kind of a a newer, very good, and a little bit simplified version of a consensus algorithm for distributed data systems.
It's kind of known as being a little bit simpler than Paxos, which has been very popular for many years. And I want to dive in and talk a little bit about how it works. We'll start by looking at a system where we have five nodes.
Usually, when you're doing consistency or consensus, just like with Paxos, it's good to have an odd number for your group. So, three, five, seven, we'll run with five here. We'll think about how this works as in pretending that all of these servers just booted up and we're just now, sort of, starting this data system, right? So, currently, all of these nodes are running some software that, you know, has the Raft algorithm in it. And so, they know that they're all a part of the same group, but they really don't know anything else. No data has been inserted or anything like that.
The first step, in order to get a Raft group into a point where it can start accepting read and write requests, or really write requests, is we need to get consensus on who the leader is.
The way that this works is any one of these nodes can just propose, "I want to be the leader." But then they have to get a consensus, a they have to get elected by consensus from the group, or by a majority. One thing you might think is like, "Okay, if all of these servers start up at the exact same time, why would they all not just simultaneously say, 'I want to be the leader, right?'"
And they could all at the same time say this and then all start messaging each other saying, "Hey, vote me as the new leader of this group."
The way that Raft gets around this, both at the beginning, but also later on when the leaders change or when nodes fail, is usually by doing some kind of random delay and random back off, where they're not all going to initiate this request at the same time. So, if they all have a little built-in random delay, like, "Hey, anytime there's an election and you think that there's no leader currently, randomly delay making a request to become the leader by somewhere between 0 and 100 milliseconds." Then, it's generally, you know, you're going to avoid most of the the conflicts. So, we're going to assume that number one pings up first and it says, "I want to be the leader." So, we're going to make it blue and what it's going to do is actually ask all of the other nodes.
It's going to say, "Please elect me.
Please elect me. Please elect me. Please elect me." And essentially, all of those, assuming that they are healthy and assuming that they have not already voted for somebody else or themselves asked to become the leader, they're all going to reply with "Yes." And this, since it has a majority, it didn't actually need all of their votes, it just needed uh it and at least two others to make a majority of the group, but now it's the leader. This leader essentially gets to stay the leader either indefinitely until the node fails or there are these time epics that you can have, where maybe uh once every minute or once every 10 minutes or once every hour, just by force, you kick off a new epic and cause a new leader election to happen. So, you can do that purposely, but you can also set it up where for days, weeks, months, for as forever long as this node lasts for, it is going to be the designated leader of the group. But, it's the responsibility of the group to also just send periodic messages for the leader to broadcast, "Hey, I am alive." So, you could set this up either in a polling way, where all of the non-leaders have some timer where they say every 100 milliseconds, they poll the leader and say, "Are you alive?" And the leader replies, "Yes." "Are you alive?" The leader replies, "Yes." Or it could be the other way around, a push mechanism, where every 100 milliseconds, 1,000 milliseconds, the leader just pushes out a notice to all of the other nodes saying, "I'm still the leader. I'm still the leader.
I'm still the leader." And basically another round of leadership election won't happen until something goes wrong with the leader and it's no longer sending those messages or responding to those messages here. So, these will have some kind of timeout like, "Hey, if you haven't heard from the leader in over a second or 2 seconds or 5 seconds, you can go ahead and start a new voting round." All of the the requests go to the leader if we want to add something new to this database. So, we'll use a similar example from before.
So, we'll say insert um 10 and Joe, like a new key-value pair that we want to put in the database.
So, we're still have these heartbeats going along in the background, but in order for this to be able to respond to that client say, "Yeah, I've gotten what you want." Again, just like with Paxos, getting um a majority vote is very important in all of these different steps. So, it's going to take this 10 Joe and it's it happens in two rounds. The first round is it's going to send or it's going to store this in its own little uh log. You can think about each of these have like a log of what things have been inserted or changed.
And then it's going to send this around and say, "Hey, node number three, I have this and I want to insert it. Please go ahead and you can even go ahead and write it to disk, but we're not going to consider it committed yet. So, don't tell the world that you have this yet."
And so, it's going to send this to be appended to the log of all of these other nodes. We'll draw arrows just to make sure we know like the sequencing here, right? So, first round is it goes there and back and says, "Uh please prepare this." So, this is like the prepare phase. And then they're all going to come back and say, "I have it prepared and I'm ready." So, once it has heard back from a majority, again, it knows, "Okay, a majority of these nodes have written it to disk and they're ready to commit it." And then as a separate phase, I will send a message out saying commit. And you can't do those all in one phase cuz you you can't just send this out to everyone and say write it and commit it because you don't necessarily know how many people are going to respond back and you can't say commit until you know that enough nodes have it written on disk. We do that whole round and then there is a second round. So, we want to commit.
Commit, commit and they'll also respond back saying that I've committed. And so this is essentially what happens every single time that there's a write.
Ideally, all of the nodes get this written to it in a timely manner. But it is possible that as time goes on, right?
Like okay, we wrote 10 and Joe and then another message comes in, 15 and Jill.
And then, you know, there's another round where all these get written over here, right? So, I'm going to make a network partition. A network partition just means something happened where these can still communicate with each other. So, there's all of these nodes have the ability to send messages directly between each other as well, even though I hadn't written that out.
Um and these can, but there's no cross communication between these.
So now, maybe I get a new insert that comes in, insert 20 and she's Ben. So, it goes into this one's log, this one's log, the leader's log. It gets a majority, it can commit, so it can send back. But we have a this subset could go on for a minute, could go on for an hour that are not getting any of these updates. And then at some point, hopefully, this partition goes away.
Network is restored.
But now we're in this interesting situation where not every node has the same values. And so the way that Raft gets around this is like the next time a new insert comes in, 21 and Bill comes in next. So, this goes to the leader, but then it goes out to all the followers. And when it goes out to the follower, we're also going to be tracking essentially like log position identifiers and it will notice like hey, you actually have like skipped ahead from the last log position that I have on disk and so I must have a gap and so there can actually be an extra round of communication where this goes back and says essentially backfill me on all the data that I've missed out on. So it'll also send over like these two, right? Um and that that's some overhead because that'll cause the whole thing to take longer. Um but it can backfill all of that missing data in there. These things can happen where like a single server actually is a part of multiple raft groups because sometimes what these systems will do is on a really large database, maybe you have 100 terabytes and you actually chunk the data up into 10 gigabyte chunks. So there actually might be a thousand different raft groups spread across your 100 servers that all, you know, share some subset of the resources. Um so a single server can be a part of multiple raft groups in itself. So there's all these complexities, right? Nodes being able to fail, lots of things happening at the same time. But one of the reasons why this just whole model of when it actually comes time to do an insert, it's a relatively simple process, but that's partly because we have the the complexity of the leader election phase that has to happen ahead of time and we have the complexity of the heartbeats that are constantly going in the background while all of this is happening. Let's actually talk about one more case, which is okay, we're at this point. Now what happens if a week, a month, a year later my leader server fails. Basically at some point, right, all of these are expecting, we'll call it every 500 milliseconds, they're all expecting an update from the leader. I'm still the leader, I'm still the leader, I'm still the leader.
So now they're not getting those and each one will have some timeout where after a certain threshold of whether it's maybe it's a full second or 2 seconds, but after some timeout threshold, 3 seconds, it will say, "Okay, I haven't heard from the leader in several time periods.
So now I'm going to assume the leader is dead, and we're going to start a new election. Because we have that random offset of we ensure that they're not all going to decide at the exact same moment, one of these should win the race, and let's say this one wins the race. And now four is going to say, "Okay, I'm the first one whose timer went up, and I'm going to go around with for a new election, and we'll go through the process again, right? It'll say, um "Hey three, will you vote for me? Hey five, will you vote for me? Hey two, still send a message to one, but one will never reply, but that's okay, cuz it'll still get back probably three yes votes, and then it will become the new leader.
So, when it's doing commits, it will still have to make sure it's committed to a majority, so it'll have to make sure that three nodes have everything that gets inserted. Um but now we have uh a new leader, right? And then eventually, hopefully, this gets repaired or replaced, and we get back to our full five.
Ähnliche Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











