In distributed consensus systems like DSR, view numbers must be durably persisted to disk to ensure safety during system restarts. This is achieved through atomic write operations that write the view state multiple times (typically four copies) to handle potential disk corruption or power failures. The system maintains both an in-memory view and a durable view, with the durable version representing the state that will be recovered after a crash. This atomic persistence mechanism ensures that replicas never forget their view commitments, preventing safety violations where a replica might incorrectly accept messages from previous views after a restart.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
IronBeetleAdded:
Heat. Heat.
Heat. Heat. N.
Heat. Heat. N.
Heat. Heat.
Heat. Heat.
Heat. Heat.
Hello everyone.
Toby, it's you again. Sorry.
>> Oh, hello everyone. Welcome to Iron Beetle. Today we are continuing our saga through pragmatics of consensus.
Um yeah, I will hand over to Alex to do a proper introduction.
>> Yeah, thank you. Thank you, Toby. And so sorry again for putting you on the spot, but like apparently like you know there is like the visual order of the scenes in my OBS and then there is like the actual order it turns uh things on. Uh so yeah it's kind of like you know when you use automation to automate something like you know you must ensure alignment otherwise uh bad things happen. Uh so yeah uh I wrote it all today as usual because it is Thursday 5 p.m.
And we are looking into pragmatics of consensus. How to go from a theoretical understanding of can algorithm to an actual uh production ready implementation? And I can't believe it Toby. Did we uh did we actually reach consensus on our t-shirts without any kind of coordination? That is that is perfect. Okay.
>> Perfect. Uh so uh we are doing this whole thing uh top to bottom. Uh so expect more episodes than just this one. Uh but last time we broke the ice on the most interesting part on the view change. Uh as a reminder uh and I guess uh we could maybe start uh drawing uh something here.
Uh as a reminder uh 90% of consensus is just replication like the heavy path uh where you uh make sure that all nodes in the cluster have the data and it's like really straightforward. Uh it goes uh through our hash chain and uh that's it. And replication path works as long as you have a stable primary and most of the time you do have a stable primary and then there is just 10% uh of the time where something uh bad happens uh where things break and there you need to do a view change and uh this part here is the content of proper. uh it is crucially important to understand why it works to see why the whole thing is safe but from a performance perspective that's actually not a common case.
Okay. So uh let's uh dive straight into um the code here. Uh well uh not into the code. Oh, okay. This was I'm I'm trying I'm as I'm trying to like press a shortcut to increase the zoom level and I'm like very hesitant to do this because like every time I like press a shortcut I like reboot my computer or something else happens like never never what I what I intend. Um very bad shortcuts. Okay. So uh last time we did a handwaving description overview change. Uh the goal today is to start getting to the bottom of it. uh understanding every little uh practical detail. But before we do that, let's quickly review the hand waving parts. So uh it starts with uh a state uh we have on a replica uh which is can I uh want can I click here and like edit something? Okay.
Uh oh no something bad happened. Uh beauty >> that's the magic of shortcuts.
>> Yeah. Uh so uh replica have like um the main state on the replica is a pipeline.
So this is a short suffix of the log which contains prepares which are currently being replicated. So they are not committed yet and we do not yet know whether those preparers will or will not be committed.
That's kind of uh the server state. Uh then uh there is a log view and log view tells uh which view this pipeline is consistent with. So I I wonder do we have that uh uh okay we we don't have uh like that hash chain illustrated uh well I mean we could do it again so uh if you recall the way that the log grows in consensus is in this kind of like you know a kindergartenner uh draws a tree pattern uh where uh at some point in time during view change. Uh, a log can fork momentarily and we have like two versions and like vub1 and v2 and this like distance where we could have divergence is exactly bounded by the length of our pipeline.
So every pipeline belongs to a particular view and log view tells like you know hey which view this pipeline is from and within a single view uh we have a very consistent uh understanding of what is the order of prepares because it is literally the order of prepares physically on the primary for this view.
Okay then we have view. So view is uh view uh in which replica currently is and it tells us like you know which prepares it can accept. So basically under normal circumstances view and log view are much uh when we enter view change we increment view but do not increment log view. So if our log view is 10 and our view is 20 that means that uh hey last time we participated in consensus and processing some preparers we were in view 10 and we promised to never participate in views 11 uh 12 14 etc etc. Uh and we yet might participate in view 20.
So uh to actually participate in view 20 to advance our log view from 10 to 20 we need to receive a view message from the primary. View message contains the canonical pipeline for the view. So primary gets to decide what actually is included in this view. Then it says to us and at that point we adjust our local pipeline to match uh what the primary says our pipeline should be. and we say okay uh this is my log view uh and then we could start accepting prepares. So the rule is we can accept prepare which matches our current view and log view uh because we know that this prepare comes from the primary from the current view. So it must be correct. If the uh view in the prepare is earlier then we apply do not know whether it belongs to our view or not but we could check if it is consistent with the hash chain. If it has changed to the latest prepare we have which we know is valid because we uh received it from the primer. So if it if it matches we add it. If it doesn't match, we uh keep it in our cache.
I have two questions.
>> Yeah.
>> So one question is that means there's strictly a requirement basically insert that the view is equals or larger than log view or other way other way around log view is smaller or equals to view. Right? It cannot be that log view is higher than view.
>> Yes, that is correct.
And if if we get an request that is in the lower view number, as you mentioned, we don't know right if it's part of our branching tree. But do we do we know it if it's like 200 ops behind because of the pipeline length? How was this again?
Can we repeat this again? Because there was some some thing we could do uh based on the pipeline length or is it only when we are on the same view that we know it's committed? So uh if we have a prepare which is like 200 ops before like our current head we know that this prepare is beyond the pipeline.
So let's say that that is op like 100 and like we are currently at like and we are currently at like 300. So uh we definitely know that op 100 is committed.
So the prepare the prepare which is at that position in the log not only in the log in this view but is also guaranteed to be in this position in every future view. So it is safe to process it and reply to the client. However, we do not a priori know if the prepare we just received is the right prepare for op. So uh like let's say we are currently in view number five and we receive prepare uh for op 100 which says view equals 3. There are two possibilities. Uh either uh well uh that prepare is actually what is supposed to be there in this view. uh that is the primary also has this repair or it might actually be that the right repair is 100 view 5 or 100 view 4 or maybe even 100 view 2 like we we need extra information there uh which we get uh indirectly from hash chaining and if we cannot get this from hash chaining if we have some gaps uh then we first go and repair those gaps by asking uh like the primary hey like you know uh this the latest prepaider I do have and this is the gap.
Could you please uh well let me correct it. We we don't go to the primary. We we go to any other replica uh and say hey this is prepare I have like I'm sure that this belongs to the view and this is the gap below. Uh could you please tell me uh what is there and then we can trust them because we could actually check that whatever uh they told to us has changed to the data we do have.
But the other way around >> is secure, right? If it's from the same view that I am in and it's like 100 ops behind, then I should definitely know that it's committed, right? And that's the right thing. Or is that also unclear? Okay.
>> No, no, no. That that that is correct because you know that uh this is from the current primary.
>> Okay. Okay.
>> Uh how does commitment happen? how the pipeline moves forward. So the way pipeline is kind of like field is by primary accepting a request assigning it op number time stamp power check sum replicating it. Uh the pipeline is drained when the earliest prepare in the pipeline is committed and commits are driven by prepare case. So prepare okay points at a pipeline uh redundantly contains the op uh just so that we don't have to fetch prepare to understand uh which oper and nonredundantly contains view. So in the prepare okay uh view might be larger than the view in the prepare not smaller it cannot be uh but larger it could be uh and so it's like if you have prepare okay with view5 for prepare in view 2 uh this basically means that hey I'm a replica and I certify that in v5 I indeed want to have this prepare from view 2 at this particular slot because that's what the primary for view 5 told me.
Uh so uh prepare gets committed once in a particular view uh we achieve quorum of prepare case and this is the semantics of the view number. What does this view number mean? This view number mean uh means that the replica will never ever send prepare okay for any preceded view. So uh we think about this as a fault detection isolation and recovery and as FLP possibility result.
We cannot guarantee that consensus moves forward. We cannot guarantee that the new primary makes progress. But we do uh have an ability to guarantee that the old primary cannot make progress anymore by making a sufficiently large fraction of uh replicas in the cluster advance their view thereby guaranteeing uh that they will not send prepare okay for the view which we want to isolate. So a view is a mechanism for isolating the old primary.
Okay. So uh once the old primary is isolated replicas send join view messages to uh the prospective new primary and kind of like view and join view they are like more or less the same. They contain the pipeline. Uh view is what the primary sends. So it is kind of like yeah uh this view should have this pipeline. uh join view is what backups send to the primary and uh join view says hey I'm a backup I have this pipeline and uh I am in that in e and like this like okay so this actually this actually should be uh I think log view uh kind of like another way to name log view is like pipeline view uh so uh like obviously I mean every message also contains like this view number uh just so that cluster uh quickly learns uh what is the latest because like if we know that there is a latest we want to join it but for uh the actual process of figuring out uh what is the right set of what is the right log for the new view uh each replica sends its pipeline and its log view that is the view of that pipeline.
Uh so I >> have a good question Alex. Yeah, >> sorry for interrupting here. Sure. But it's precisely to this topic here like the log view number that's a singular number but in the pipeline it could be could it be that I have mixed um views like when the branching is fast enough >> I could imagine at least maybe okay so it's always the highest log view that I >> so but the highest number I have in the pipeline is my log view >> no >> but it no okay >> no no it can be high uh it could be the case that uh pipeline contains messages from use one to three and your log view is actually five.
This means that when the primary for view 5 was starting the view, it looked at all the join views it had and there were some messages from views like one, two, three. So it decided to take those messages into its new view and it sent a new message uh for its view number five which contain those like one to three uh like you could imagine like you know that uh there are no no new requests.
Uh so as a backup you received that message and you said okay so uh from primary in view number five I receive a pipeline which says I should have messages one to three. So I'm adding this messages to my pipeline.
Uh and u setting my log view to five. Uh sometime later this replica will prepare a prepare view five and then you will add this prepare to the top of your pipeline. But until then uh you'll actually have um uh an earlier view there and like you know uh like if you think hard about this you might have like this like absolutely brilliant idea of like massively simplifying this algorithm uh by saying okay but can we like make the primary when it sends a view message kind of like you know prepare a fake request like you know like you're going to say this message like you might as well pretend that there is like some kind of like no operation request and you say okay this is U5 so that you could like completely um get rid of this log view in variant and just say okay like the highest view in the pipeline is going to be like our log. It's it's it's so so so beautiful. uh but it doesn't work and the reason uh why it doesn't work is because um your pipeline is uh finite. So like like you mentioned okay view five like you add this fake prepare uh but then like uh view number five doesn't actually work out. Uh so you you you go and say okay um like now uh replica number six uh wants to become a primary uh but there is already this like you know prepare number five in there so it it it adds like uh prepare for view six on top and then this change again uh doesn't work and I mean uh that kind of like means that you need to have uh space at least as much as you have replicas in your uh pipeline. So like your p like you kind of like get like wasted space in the pipeline. Uh and the way you uh might try to get this space back is saying okay but like you know if like replica v inserted this fake message and then replica v plus one because the message is fake we actually like uh don't need to restore like this uh previous message. So we could just override it and I mean like kind of like that reduces to having this uh log view. So uh like like like this is like this is like but but maybe this is a nice way to think about it. So kind of like log view is like this extra bonus slot in the pipeline for synthetic prepare which the primary injects on a view change. And because this is a synthetic repair uh we uh don't actually have to store anything there except like the new number.
>> Okay.
>> Okay.
>> Okay. So um yeah uh how does view change um um works? Uh let me repeat uh handwaving before we uh get into details. So uh handwaving is this uh the primary receives a quorum of join view uh from replicas. Uh a quorum of join views uh tells you two things. First, it tells you which pipeline was present on the replica and second crucially because it also contains okay uh I give up. I actually need both view and uh log view in my join view. So I will go and add this uh so it contains pipeline and it contains a view and view is a commitment that this replica will never ever send prepare okay for any preceding views. So as a primary you could look at the state of pipelines of the quorum of uh replicas. Uh you could take into account commitments to not send prepare case. So you could figure out what could have been and could and will never ever be committed in those proceds and uh figure out a safe uh log to start with. And so this is this is a subtlety like you can't say like if you like trying to start view five and you have like uh join view from replica which was in view 4 you and you look at some slot in its pipeline you can't actually say whether it will or will not be committed uh in view 4 like this is like undetermined but there is like possibility to avoid one way mistake. So like you could say okay uh this for sure cannot be committed and this might or might not be committed and that's why I'm taking this prepare because well uh if it's committed uh we aren't going to enter script brain again uh details follow um okay now details uh uh I want to uh get physical here. So uh and for like for a short moment we actually uh will get distracted from view change.
So as I said uh view is kind of like a durable commitment to not send prepare key right. Uh so uh this means that like you should never ever forget the view uh you've said uh you are at because if you uh tell the primary hey primary I am in view 5 so I will never ever accept anything from you 4 and then like you know someone pulls the plug on you and like you shut down and then someone plugs back and you restart and you're like okay I don't remember that I was in U5 I receive a message from replica from U4 and I said prepare okay and like yeah now now you possibly break maybe they shouldn't have uh plugged you back in uh so uh we need to durably store uh this view message and I won't touch a little bit about how we achieve durability here because uh it's going to be hard to explain the precise semantics of our pipeline because that depends on durability because we also make this assumption that prepares can get corrupted on disk for protocol aware recovery. So kind of like I don't I don't want to uh first tell you like a fake version which doesn't do par and then do like a real version which does par. I want to do like the real thing. So uh like uh when we uh drew this thing out and uh when we even like look at the code uh let's uh look at the code um you >> can you maybe say which file you look at because I can't that's replica seek transition to view change status.
Uh so somewhere here we should have this magical uh view equals view + one.
Uh somewhere we should have this >> in line 10,110.
>> Uh >> oh, that's somewhere else.
>> 1000 to 24. I guess I also uh which are you on? Uh is it um the latest? Okay, let yeah >> let let's both switch to uh main.
>> Okay. Uh >> transition to view change status and there's uh this line uh 100 uh like well 100 10,185.
Yeah, as you said. Okay. Uh so this is like you know magical we we uh increase our view like yeah but this this is not how reality works. Uh we actually need um to make sure that uh this is Drupal uh that uh this is stored on disk. And by the way uh this is uh the uh place where we do uh deviate from the paper because in uh the paper version of DSR DSR kind of like is a fully inmemory algorithm. So it requires I think like majority of cluster to be online not only to provide availability but also to provide safety like if your entire cluster shuts down and restarts then uh like the the VSSR as in paper uh like uh it will give you incorrect results. I mean from the perspective of writing the paper this is actually brilliant to show that you do not need persistence uh to uh implement consensus that you could have like you know hot potato consensus uh but from like the practical system like obviously like you know you do want to uh survive like you know the bladeunner style blackout of everything uh and restart and do not lose your data. Okay. So uh that means that like we kind of like need to make this uh operation of incrementing view atomic and persistent. And how do we do this? Okay. So uh this is kind of like a side quest.
how we so like this view number is somewhere on disk and uh we want to implement the iteration of upgrading this on disk. So let's say like you know this is like eight bytes uh of our view number uh well four bytes right and like you could imagine like you know you do just like a right sys call and that uh goes to your hard drive uh and like if that hard drive is like spinning rust it like starts like moving like this arms and like this like me mechanics to update those byes on disk and if the power um uh goes out like in general you could have garbage there like you could die like right like in the middle of a particular bit because like nothing like in physics is atomic. Uh so like we we somehow need to cook atomicity out of uh non-atomic rights. Unless you do want to make some extra assumptions about hardware which I mean also is a fair game like SQL I believe does rely on small disc rightes being actually physically atomic but the cool thing is you don't need to. So um this is the idea. The idea is like very simple is like basically you write multiple tags.
uh in tag uh the key information for consensus like um this like view and log view is stored in the so-called superlock and uh superb blocks uh uh follow like logical hashing. So uh when you start your cluster you read superb block when you want to update it because for example you updated view you actually include a check sum of the previous superb block uh in your new super block.
So that's kind of one idea hashing. The second idea is that you literally just write the super block um four times.
Uh okay, I'm running out of space. Okay.
Uh let me try to do a smart thing. I always like increase my zoom uh and then like I end up like with like 1,00 uh x zoom. Uh but actually let me try to make this thing smaller.
Okay.
And let's say uh okay this is like not a smart thing. I now uh want to say that all those will actually uh v1. So this is like our layout on disk at the start and then uh when we want to upgrade it well we we prepare uh v2 And then we start overlaying those in place one by.
So uh we go and say okay uh let's do this and then this and etc etc until we do uh four in a row. And uh this gives usity uh because the rule there is essentially um we write those things one by one. So if we get uh a power cut in the middle at most one of those four slots uh will get corrupted uh and uh three slots uh will be alive and we could pick and there's like you know two versions in the three slots. So we could pick uh that versions that version of a super block which have majority. So we'll always correctly um read either the old super block or the new superb block. And there is like I think also uh extra twisted hash chain in here like you know maybe we have like a lot of corruption and we have only one copy of the old super block but in the copy of the new superb block we actually can uh verify the check sum that the previous one is correct so we actually still can uh start it. So yeah, that's uh basically kind of like the lynch pin of correctness here that when we want to update view, we updated durably by updating our super block and writing the super block four times. So uh even if we crash in the middle then uh uh we will be able to restart either in the new super block or in the old super block. So uh that's uh good from a perspective of autoomicity.
Uh the problem there is that we kind of like edit atomicity but at the cost of asynchrony like we need to do four fyncs here like one after another. So like we cannot just like you know uh say okay update update our view and then consider that our view is updated. So, we do Okay. Uh I I feel like I'm like going way too fast here.
Uh Toby, do you like understand uh what I'm talking about and do you see the problem >> with a synchronous?
>> So, I understand what you what you're talking about. I understand that we need to if we want to do this, we need to do it sequentially.
>> Yeah. And that's not great. So, because that blocks the full pipeline basically.
So, you cannot do anything else until this is done.
>> Yeah. Yeah. Yeah. And I mean uh it's like the difficulty here is not even so much like logical like I mean I think if we just like block and do nothing there like it's it's not it's not going to be like super uh big performance degradation for us because uh you upgrade views readily.
Uh the problem is just like how do you program even this thing? uh because like you know we do this like uh highly asynchronous non-blocking IO everywhere like we don't have like a primitive like you know write and block uh like because that's the AO API we're doing here. So uh that's why in the code if you look at the replica uh we'll have view and log view but also such thing as log view durable and view durable.
So uh this is like uh confusing uh but I will explain it to you and it uh won't be uh confusing uh hopefully uh if I do a good job of it. So view is your in memory thing which you update immediately.
Uh once you update your view in memory, you create a new superb block and you start writing this new superlock to disk.
Uh log view uh like view durable is the value of a superb block you currently have. So like let's say you are in view 10. So normally like you and your durable match uh they are 10. Now you decide okay I'm I'm going to update this to 11. First you change your in memory view synchronously because it's in memory. Uh then you start writing uh the super block with view 11. While you are writing the superlock with view 11 you are in a situation where view is 10 uh well view is 11 and view durable is 10.
Once that superb block finishes writing your uh view durable becomes 11 because now you like if it's finished it if it finished writing successfully you know that if you crash and restart you will not reward to the previous superlock.
Uh yeah so basically uh view is yeah this is this is maybe the right way to explain this. view is your like current view, right? Lo uh view durable is what view you will have if you crash at this point and restart because you restart without your memory. So you read view from disk and that still has or at least could have the old view. Um, okay.
And this kind of like makes it very hard to like actually write the code because view is a commitment like you know you promise to not send uh prepare okay for all view.
Uh but if you have like these two views to keep track of like how how how do you even like program this right? uh and the answer is like super super glorious hack. So at the very edge of the system uh in the send message to replica uh we have somewhere here uh well we should have where where it is.
Okay. Yeah. So um when we send a message so like we we kind of like the the way the code is written we kind of like imagine that it doesn't exist uh that we like in this like fairy tale land where like our memory is like atomic and durable and persistent and we just use our view but then uh right before we send message we check uh the message view against our view durable and um uh if it is uh the kind that will violate our invariant uh we just say okay we aren't going to send this and I'm actually >> do we drop it then or do we >> yeah yeah yeah we just we just drop it I'm just I'm just confused why we have two different ifs here. So, uh okay, probably this is what I want to talk about. If this is a join view, view or prepare. Okay. Uh then we avoid externalizing this.
Uh oh, right. Because we want to compare.
Well, okay. Uh let let me say it this way. Um this probably needs uh some like you know uh less handwaving and actually uh going into details.
Uh but uh like we'll maybe revisit this later. Um but once they're prepared uh but the basic idea is this uh we update our view asynchronously. uh we write our code as if we don't update our view synchronously and we just drop a dangerous messages at the edges of the system.
So okay uh this is the first part of persistence like how do we uh make sure that we do not forget our views and that is by updating the superb block and I want to note that relatively speaking updating the superlock is still um expensive operation uh because we do those four rightes uh sequentially I think I I I I'm not sure, but I think we actually could do them in parallel and that's probably going to be Oh, wow.
Huh.
Yeah, I think I think I think this actually should work. We should rather than writing them in place.
Uh, okay. This is beautiful. Uh, >> we should write them out of place. Uh yeah, maybe we should actually have like, you know, two >> that's actually called double buffer in databases and they do that.
>> It's because you always switch the buffer basically and you can make this model or something.
>> Yeah. Yeah. Yeah. I think >> that is really beautiful.
>> Yeah. Uh okay. I'm glad I'm glad that someone else has invented this before because that means that this idea is not completely stupid. Um uh but I I think actually we don't we don't do this but if we were to do that then it would be correct for us to like not do absync sequentially but rather uh do that in parallel. Uh but anyway it's kind of like doesn't really uh that matter that much uh because uh we are going to upgrade uh super blocks not frequently. Okay Toby am I correct that you are to go in 15 minutes?
>> Yes and I have a few questions. Uh >> okay that's that is perfect. That is perfect.
>> So if we would introduce this double buffering um well let me phrase it differently because we don't have that.
So we have these four versions but to to be able to recover from torn right only two would be enough right. So basically we have v1 and v1 now we override v1 with v2 and now we corrupt this block because we have a torn right and so we still have one v1 but we have four because of our fault model right that we suspect that another sector is gone.
>> Yeah. Yeah. Yeah. Uh this is this is so uh like one is going uh to get corrupted like by definition uh because this like just normal turn right uh another will get corrupted because uh disks sometimes have like sector errors like you know some bug might chew on your uh flash drive and like you know now now it contains garbage. Um and uh with two out of four, if they are both old or both new, uh we kind of like have like very strong confirmation that hey like this is this is legitimate, right? Uh because they match. Um if we have old and new uh then we still what we do is that in this situation we recover into the old one because we have more guarantees for the old one because we also have a checkpoint in the new which points into old. Uh so that even if a like you know actually two corrupted blocks uh we still um uh have like more than one copy uh just to like >> but if you recover in the old one don't you break your promise?
Uh no because if you recover in the old one that means that like that old one wasn't overridden by the new one and that means that the corresponding fync uh which the previous instance of the process issued haven't finished because if it have finished it would be like you know overridden or corrupted. So you know >> okay yeah that's that's like that's that's that's a part and uh for those uh of you following at home uh the relevant uh source file here is uh oh dear >> yeah we have a lot of questions >> this is this is this is this is >> I mean there's uh okay that's good for because I also see that there is like I saw like you know a glimpse of a bunch of questions in the chat But right after that, I got logged out uh from like my chat and I can no longer see it. Uh so sorry I I'm I'm not answering your questions. Uh not because I'm like uh vindictive or something but simply because I am not uh seeing that. Uh but Toby, okay. Uh you have more questions.
So we'll hopefully cover something that people are interested in as well.
So one question I was wondering was but I think this cannot happen because we are have bounded number of replicas that you view change so often like very fast back to back that somehow your IOPS run out like we have bounded IOPS right so you cannot uh write infinitely or have as many IOPS in flight as maybe the disck would would support so let's say we have bounded to eight and then you do a lot of lot of view changes Could it be that you drop some of them?
>> Probably not, right? Or maybe >> that is an excellent question. I've actually haven't considered this, but I think this this works this way. So, you have two copies of a superb block, right? Um like working and writing.
Uh uh honestly okay I'm okay I I must like I I I don't actually remember what we do because there's like there could be like a couple of things. One is that you could just ignore the message while your super block is updated and like you know you wait for IOP to finish.
But another thing is that okay if you don't have NIO that means that superb block is currently being written and if you buffer one extra super block what you could do is that you could update it in place. So okay let's say that like you know you are in view 10 now you are switching in view 11. So you start writing the super block for view 11. Now you want to switch for view 12 you are still writing view 11. Well, you could still go and in memory say okay um my view is uh 12 and and like you kind of like make a note that okay once I finish writing super block for view 11 I will go and write the super block for view 12 and then uh you do view change one more time uh and like you now want to do for so like you know you say, "Okay, I I now have to write I I now have to wait until I write 11 and then I write 12 and then I write 13." And this creates a problem uh for database with memory allocation with static memory allocation because where are you going to store all those blocks, right? But uh and now I actually do remember how it works. Uh the idea is that you could coales superb block updates like you know you don't like if you know that your current view is already 13 uh you don't have to write 12 right you could write uh 13 uh immediately and I think that actually what happened so let me try to uh find this in replica and uh it's good that you cannot see my screen very well because even if I don't find it I could pretend that I Uh so uh let's see okay let's let's actually uh trace this whole flow. So like the entry point is a few durable update you know uh we say hey we want to update our blog uh here uh we go and say hey please dear super blog uh like update yourself and call uh view durable update callback uh once you're done. Okay, let me actually maybe make this a little bit bigger. Uh uh like please call this callback when you are done. And like uh this is like the information uh which I want you to write and there is like a lot of like different kinds of things here but the main thing is like you know view and log view. So we do here and then we go to view durable update callback. So after super lock is written and what we do here is uh we actually check uh whether the view in the superlock we have just written matches our current view. And the same is for like uh log view. And if we say that hey actually uh if uh while I was writing superb block my in-memory view advanced forward as well let's do it again and go and uh write superlo again. So this is this this is okay this is Toby this is a brilliant question because I completely forgot uh that we had uh this kind of chain loading uh but it's yeah it's like really I mean it's it's it's kind of like trivial how it works uh because it it really kind of like is a possibility that you are running out of iOS and like in database with static memorization it's not possible but like you know you could actually have this carless and the way kales works is uh and that is what I was confused about is that we don't actually like you know prepare this like in-memory superlock uh separately which we want to write uh no uh just after we finished updating superblock we check whether uh we need one more update and and if we need one more update we go and like do this uh one more update okay it's brilliant question thank you to a Does does this destroy our check sums here? No. Right. Shouldn't be because it's not a linear chain because we coales things. But they they should never have been written to disk. So we basically take the check sum of the >> oldest durable version.
>> Yeah. Yeah. This this is okay. This this exactly what was uh confused for me when I said hey I don't remember. Uh because yeah I was I was thinking like if you like physically have like se several versions of a block and like physically coales then like you uh need to do um something with a checks. But the trick here is that you don't actually materialize the super blocks you want to write up until the point you finished writing the previous superb block. So it's it's not like you like schedule writing the super block. Now uh like you have just two operations update in memory state and sync the state with um was disk and I believe uh actually yeah I think in this new durable update uh we should have yeah we we have this early return. So basically the way you you you write in the code is that you update inmemory state and then you say okay sync inmemory state with what's on disk and inside this uh sync with immemory state inside view Drupal update you say okay are we already updating something and if we are you say okay I'm not going to do anything but I know that once that thing finishes uh in a callback uh it will call view durable again.
>> Ah so basically you create back pressure when you have already one in flight and so basically that's how we can make sure that we coales uh quite nicely right because as long as something is in flight and we do view changes on top of that we will basically update our inmemory state until the the written view is basically uh committed. Yeah.
Yeah. Yeah. Uh dropping the messages uh outgoing messages uh in the meanwhile but accepting uh incoming messages. Um so I I think uh can we actually uh transition to uh view change status? So actually maybe uh on uh view change those time out is a little bit on.
So I I want to so previously I looked at the function transition from normal to view change uh to see how we update the view. The more pure way is when we are already view changing and we like just update the view without uh updating our log view and uh if you recall that happens on the exit view messages. So I want to see uh what happens if we receive a quorum of exit view messages.
We go uh we go transition to view change status and transition to view change status uh works uh for normal status and for view change status.
Okay. And here we do yeah and this literally those are like those two lines we update inmemory state and then we say okay now synchronize inmemory state with the state on disk.
uh or like I guess more precise way to phrase it would be to ensure that eventually inmemory state is synchronized uh with on disk state which is in the normal code path we just go and directly write a superlock but uh if the superlock is already being written we just early exit and rely on the fact that once we finish writing the super block we will again kind of like chain chain write it uh basically you there persistently repeat the operation until it succeeds.
Uh okay. Uh this is brilliant. Uh because I think I covered everything I wanted to cover in the first hour of our today's episode. I mean I prepared like five hours but it's fine to cut here.
>> You should answer the question in the chat maybe if you can still find out uh find a way to see them.
>> I mean I I I I don't see them. Uh, do you see them?
>> Um, I think you can see them if you go to YouTube like to our real live stream then it should batch them, right?
Because I need to leave, but I can send them to you.
>> Okay, let me actually >> Okay, that's that's that's perfect.
Right. I I don't I don't need you to answer questions uh in the chat. Uh >> uh.
All right. So, um, let me send it in Slack because that's maybe a bit nicer.
>> Okay. Okay. Okay. Okay. I actually actually actually actually see it on YouTube. Okay.
>> Okay.
>> So, Toby, uh, you can leave and uh, I'm not sure. Can you You probably will have to like turn your camera yourself off because I I mean, I can try, but we'll pro this will probably end badly. Uh, >> no, no, that's fine. I will do it.
>> Thank you so much. Don't don't don't don't don't start changing into like uh your other clothes like right camera.
>> Okay.
>> Byebye. Thank you.
>> Okay. Bye, Toby. Uh see you in a week.
Uh okay. Uh okay, Toby. Oh, wow. Uh we even have uh more space um on our screen. Okay. Questions?
uh like uh question uh why do we need four superb block uh copies? Uh so I think the answer is that uh well the honest answer is I don't know like I need I need to check the source but I think we actually could use a little bit less um but the principle uh thing here is that we also assume that the hard drive uh might return errors.
So what I mean is that you know we issue a right sys call and then we issue apps sync and write returns successfully. FS sync returns successfully and then due to some hardware fault in the hard drive when we read the data back it is corrupted and this is kind of like unlikely uh thing to happen.
uh but it does happen and we do want to handle uh those kind of situations because we already have multiple copies of data uh for availability. So we might as well use it to deal with nearly uh Byzantine disc and yeah uh basically if you have like only two copies uh that's definitely not enough because let's say you overwrite two copies you overrote oh first one successfully and then you override the second one and then you have like a power cycle and when you restart and turns out that the second copy is corrupted because like you were uh powered off in the middle of it and the first copy is uh corrupted because there is like you know uh one in a million uh hardware fault uh and you kind of like do want to be more resilient than one in a million uh okay and yeah I might three might be I mean do we actually have this in superlock forms I think there was like some uh comments uh or how much Uh well okay maybe maybe you actually have more than four uh I I know I I'm now I'm now wondering like what what is the actual value or the parameter here uh copies uh cluster super copies well no we have four uh but yeah kind Like the uh like high level answer here is that in like updating super block is like costly and you just want to do this like as rare as possible. So it's not like really worth it to like super optimize it. But maybe maybe three is enough. Uh that's that's a good question.
Uh but then if it's three, it's not power of two and like it's just ugly. Uh uh next question. Uh uh are we using uh are you ringing or fync? Uh we we are not using f-sync. Uh actually uh this is kind of like a simplification um like as as a separate Cisco uh like the semantics we currently uh rely on is that we treat every single write as durable. Uh so basically uh when we uh write something to this uh in our storage uh storage uh right sectors uh like the contract of like right sectors is that uh by the time uh we call this call back uh like every right is durable.
So there is not like you know we schedule a bunch of uh rides and then we do f and uh we do this like mostly for simplicity. Uh like you could like potentially you could achieve faster performance if you like batch your rights and then do like one app sync.
uh but uh you could actually achieve similar results by being more careful with uh scheduling your rides in uh user space uh before you hit the car. uh the way we actually achieve this in uh are you I don't remember like uh you need to pass certain flags for uh opening your uh data file and for individual rights.
One is to disable using like operating system page cache and another one is to instruct uh the disk itself itself to actually flush data. Uh but like details are in our source code like you could find them.
Okay. Uh the question uh next question uh you can also just buffer messages until durable view increases beyond the prepar. Yeah.
Yeah. That's that's true. Uh like when we do that thing where we drop messages uh if the view doesn't match we could buff them. uh the like problem for us is that we do static memory allocation. Uh so if we like buffer this message that means that we uh need to bump our uh limit of messages like we need to allocate more memory and it's not necess necessarily true uh that uh buffering is uh the best uh use of messages but I mean it's it's not necessarily true that it is the worst use of me memory maybe maybe this is a smart thing to do honestly like we haven't checked a lot of questions like this is this is super helpful. Uh okay, but maybe that's it. So you don't do that. Yes, exactly. Uh uh okay. Uh yeah. So so sorry for uh sorry for not uh answering the questions uh immediately because yeah kind of like it uh my uh my streaming tools uh failed me. Okay. Direct PDNC.
Okay. Yeah, I think I um Okay, one one more question I forgot to answer. Uh you could tolerate Apple disk faders without data loss. Uh I mean um uh you could you could tolerate f full disk failures uh by virtue of just tolerating f whole machine failures.
That's that's not the interesting part.
Uh the interesting part is that uh we could tolerate uh more failures as long as they are not all these failures. So you could essentially corrupt uh on like okay uh this is what you can uh you could corrupt one sector on every replica in a cluster and we will survive it like we'll uh remain available we'll remain safe we'll remain uh correct uh and I think well I mean okay I'm I'm not going to uh stand by my words right now because I need to think very carefully about the actual invariant which we guarantee there but basically it's roughly if a particular sector on disk is found on at least some replicas uh we are guaranteed to repair this thing. So like you know kind of like you know you could take a hard drive and like you know you could uh shot at it with a shotgun and uh you do this for every replica and we recover as long as we are not super unlucky such that uh there is one location where we got a how do you call like this a shot where we got a shotgun shot for every single.
Okay. Uh the pointer question was in the reference to the double buffer super block. Uh okay. Yeah. Uh the question like uh when we double buffer like how do we figure out um like which which which buffer is the double uh like I mean I think uh when we read we like don't need this or we just like you know essentially read eight eight copies and pick uh whatever like the latest uh view there and after that well we kind of like recovered either in view uh v or v plus one but like once we know the view which we recovered from like the other one is uh becomes like the double buffer. So again uh like dependent on where exactly the crash happens and depending on faults uh you might uh restart into the newer or the older view and like one of them might uh be uh double buffered. And there is also even some fundamental ambiguity here because so imagine that you crashed while you were writing a new superlo. In this situation it is indetermined uh whether you actually in a new view or in an old view. Like in this situation it is valid to start either in the old view or in a new view. And maybe we could even detect this and maybe we could I mean going to be stupid idea but just like as a principle uh it could be the case that hey like you realize that you crashed in the middle so either one is uh fine and like literally toss a coin and uh go to the um uh new view or uh the old view. Um yeah and then the other one is double box. Uh more questions. Okay, I love the questions.
Uh uh okay. Uh can we terate? So uh we have um F crashes without data loss and uh we have another F uh full disk failures.
uh can we tolerate this? So essentially uh you have like six uh uh replica well I mean okay first of all it's not really two f plus one for us because we use textbook quums so it is like six replicas you need four replicas to do view change uh but you only need three to do prepare okay uh which is probably something uh we'll cover next But the question is let's say we have like you know three replicas completely absent from the cluster we have four so we can change um but then three replicas lose their disk so I think if you fully lose the disk that actually won't work we won't write that uh because there is this very Um well mo mostly because because of the log. So let me say it that way. If uh the log is intact so we only like um okay if the first message in the log on each replica is not corrupted uh then it's fine. Everything else we can repair like as long as like there is like one copy. Uh the problem there is that if absolutely everything is corrupted on a replica then it actually needs um well okay I feel like I feel like this might be getting a little bit too in the weeds and maybe it's better to p those like more detailed questions for the next episode. Uh I mean I I love the questions but like it's kind of like uh shortguarded for me. Uh so let let's let's do this. Uh let me try to answer this one and it's going to be last also because we are out of time. Uh but thank you. Please keep the questions coming.
Uh keep the questions coming in the meantime and like you know I mean you could even like email them to me. Uh I would be happy to uh prepare and like uh cover them uh in the episode but okay.
So what we can and what we cannot survive uh superb block uh if you lose all four copies of a super block uh then uh the replica uh will not start uh because we have this invariant that you cannot uh erase the view uh uh for forget your view. So like that kind of failure uh like in the superblock zone like we we we need we need to have a valid superb block zone. If a replica doesn't have a super block then it's kind of like permanently uh leaves the cluster like you really need like operator involvement to like manually um make a good job. But okay uh assume that you do have super blocks. Uh what else uh what else you cannot lose? Um another thing which well might be problematic to lose is uh the topmost message in your lock and really for this topic we really need a whole separate episode with Caitana um because there is a subtlety there but and this is we'll touch on next episode but basically the super block recovery is easy because we just copy uh stuff four times for write ahead log you don't actually want to write prepare four times because prepare is large prepare is like megabytes super block is just like kilobytes uh so we only write it once and there we usually can't figure out when we have corruption in the log but there are cases where it's like genuinely ambiguous you look at your log and you say okay I kind of like might have this and might not have it and you don't know what was the latest message you accept So kind of like a hash chain breaks and if that happens then the cluster um needs to have a par. So in a situation where you have six replica cluster, three are permanently partitioned.
uh two have their disc absolutely corrupted except for the superb block zone where we do where they do have at least two copies and then one replica with the intact disk and that replica with the intact disk is currently the primary uh then it's fine uh like you know it it it will help everyone else to recover like slowly it will take take a lot of time to you know well uh data transfer uh all the data but it will be done uh but you need to find a primary. So if you are in this situation and then you also like shut down and restart all four machines then uh your cluster will be unavailable uh because it won't be able to elect a primary because although four replicas will be in the cluster if they also have like all their logs corrupted they will not be able to figure out what was the head message and that makes uh view change unsafe. And I mean actually I think we could maybe like improve that somewhat but uh the general thing will stand otherwise if on those four replicas um you have super block and you have just a tiny bit of block available which allows replica to confidently say hey like this was the last message I accepted. Uh that's again going to be fine even if you like power cycle them like we'll we'll recover it. Uh I'm kind of like very handwavy here with like just how much of a log you you need to have. Uh and for that like I I really need to do a deep dive into our journal to actually understand what we are doing here because that's kind of like very subtle logic there. But it's not like you know if there is like a corruption in the log it's like game over.
like you need to have very particular patterns of corruptions in very particular slots of a log to actually make it impossible for us to safely understand our head message to safely participate in your change. Okay, again thank you for questions like please keep them coming uh but uh I will back pressure here and so that you could like you know more evenly uh spread them across all the episodes. again. Uh thank you uh very much and see you all even if you don't ask questions but please ask them um next week at 5m uh live. Uh goodbye.
Heat. Heat.
Heat. Heat. N.
Heat. Heat. N.
Hey, I know.
Heat. Hey. Hey. Hey.
Heat. Heat.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











