The bus factor measures how many people could be removed from a project before the entire system collapses; a bus factor of one means a single person holds all critical knowledge, creating severe vulnerability. To achieve resilience, organizations should eliminate single points of failure through distributed architecture, implement comprehensive observability for system monitoring, use broadcasting data communication for portability, establish granular access controls, maintain thorough documentation, and employ version control with snapshotting for rapid recovery. These practices transform a bus factor of one into a much higher number, enabling teams to rapidly respond to unexpected situations like the 24-hour deployment crisis described.
深掘り
前提条件
- データがありません。
次のステップ
- データがありません。
深掘り
Applying The Bus Factor - How FlowFuse Deployed Stack Resilience in Real Life #busfactor #resilience追加:
Hello everyone. My name is Christopher Sandal. I'm the developer relations advocate here at Flowfuse. Um, thank you so much for joining us today on this topic. We're talking about applying the bus factor, how Flofuse deployed stack resilience in real life. Um, this is a a kind of wild topic. There's a lot to go over. Um, sort of an an unreal story actually. Um, a little bit about me. I already introduced myself, developer relations advocate at Flowfuse. Um, that's not all that I do. In fact, I just joked uh yesterday that I kind of do a little bit of everything. Anything that I can to help developers understand the product, to understand flowfuse, to understand what we can do. Um, and as part of that, that necessarily means that when we do trade shows and when we do demonstrations, I'm the person that builds that. Um, which is actually going to come into play relatively shortly.
Um, so I do a little bit of everything.
The first thing we're going to talk about today is what the bus factor actually is. Um the bus factor is an interesting thing because it originated uh with software development but it has now sort of found itself in other industries and it's it's quite relevant to a lot of the work that multiple people do it in in multiple different nonsoftware engineering spaces. Um and in the industrial space is actually weirdly relevant. We'll get into why in a minute. Um so we'll talk about what the bus factor is and then we'll talk about what I am terming the event. Um it sounds very ominous. It was very ominous for me. It was very ominous for a lot of people. So, we'll talk about the event, what happened there. Um, we'll talk about our response to that event and and sort of what the demonstration looked like both before the event and then after the event, and then we'll get into some of the factors for resilience. And we'll go through these point by point.
I'm going to give you some some demonstration of the code and just a sort of look at at how this um was actually deployed.
So, really quickly, what is the bus factor? Um, as I said, this was something that originated with software development. I think the best way to explain this is this very famous XKCD comic. Um, I reference XKCD pretty much daily. Um, I feel like that's pretty common in this industry. Um, all modern digital infrastructure. That's all these stacked blocks. And then way way at the bottom, a project some random person in Nebraska has been thanklessly maintaining since 2003. That is the software story of it. It is, you know, this massive stack. you had this really complicated set of systems, this complicated set of software, and it's all basically being held up by maybe two or three pieces. Um, this is very software ccentric, but in terms of the bus factor when it comes down to industrial and when it comes down to development, um, it actually gets a little bit simpler than this. And the it's a piffy joke. It's it's said quite often sort of tongue and cheek. The question is how many people could be how many people could be sorry how many people could get hit by a bus before your entire stack just collapses. Um sorry to anybody who has been hit by a bus but um you know it's just the piffy joke. Uh in terms of the industrial side the question is how many people could be removed from your project? either removed because they have left the team, left because they're not available, left because the, you know, deployment is not local to that person. Any number of things that could result in this critical person not being able to execute in the way that they need to.
And when we talk about the bus factor, the bus factor is actually a number. Uh, and it really again is the number of people that could be pulled away from that stack. If you have one person, uh, one person holds the keys, they hold everything, they hold all the hardware, the security, the login, the everything, you have a bus factor of one. And a bus factor of one is really scary. Uh, basically means that person is critical to the entire system being alive, being workable, being deployable. Um, great for job security for that one person, terrible for the actual stack, terrible for the deployment. When you have a higher bust factor that means that there's more people who have the distributed knowledge they have an understanding things are documented things are recoverable um it is you know not just one person that is in in control of the entire keys to the kingdom it is a a whole team of people now one thing that I want to call out very explicitly here is that the bus factor isn't a measure of headcount it's a measure of resilience um and when you get into the industrial space specifically The resilience piece of this is extremely important and it is its own standalone thing. So when we say that there is a bus factor of one, what we're really saying is that there's one OT engineer, there's one team manager, there's one person who the entire stack relies on being present, on being accessible, on being able to be part of the deployment. And for that reason, in OT environments especially, this bus factor problem is often hiding in plain sight. It's something that you don't really know until the problem rears its head. you don't think it's a problem and then there's some sort of accessibility problem or there's some sort of you know the person is sick today or you know we didn't know that they have a UB key or something at some system um that we weren't aware of that this person had control of and then when they duck out for whatever reason it becomes a critical problem. Um so let's talk about the the event. All right. what what actually caused this problem, this this buffs factor thing to be exposed. Um, I'll give you this this sort of short story. I was in Japan. Um, I had a couple other things to do beforehand. And so I the plan was I was going to fly to Japan, spend a week or so in Japan, then fly uh home for me, which is San Francisco, and then fly from San Francisco to Germany. It was a tight turnaround, but the layover was fair, the systems were fair, everything made sense because I was still doing some work on the servers and because I had the entire stack that I wanted to deploy, I carried all of those devices with me. And so, not only did I have the server stack, which I'll show you in a moment, um I also had things like Groov Rio devices, I had buttons, I had uh Andon systems, I had uh so many different connected pieces. And the plan was, okay, fly home, fly real quick to Germany, and I'll go ahead and get it set up. Of course, uh, anyone who is in Europe who is familiar with, uh, Germany over the past couple months knows that there was a massive strike. Uh, and when I say massive strike, it was like a nationallywide strike. The strike didn't just affect Simple Transit. It was everything. It was Lufansa had a strike.
um Luansa City airline I think is it's called had its own. There were multiple groups that basically all struck at the same time and it made it so that I was not flying to Germany. Um and so that flight got cancelled and you know I woke up at 4 in the morning to this email um heart slamming through my chest as you can imagine and it started getting worse. Um, it was not just those flights that stopped, then it was flights from Japan to the Middle East, and then it was, you know, an additional sort of, uh, chain of events that meant there was no way I was getting to Germany. There was kind of no way I was getting to San Francisco. Um, and so I'm stranded. But not only was I stranded, all the hardware was stranded as well. And this is like the the ultimate version of the bus factor. It wasn't just that I wasn't present to do the deployment. It was that I had the deployment. Uh which is which is pretty intense. And so coming out of that, we had to talk about okay, we have this need to deploy the server.
And when I say deploy the server, what I'm talking about is I'm talking about a series of local LLM systems. I'm talking about computer vision, the Groove Rios, multiple server stacks, um virtual, you know, virtualized databases, hard machine like bare metal, uh databases, systems that required a person to actually be there working on the thing.
And so we had to quickly figure out, okay, we have this thing, how do we actually deploy it? and how do we deploy in a way that is sensible is is maintainable and something that we can actually like show flowfuse at hands.
So the answer was okay we need to rapidly deploy these new devices. We need to run the servers from Japan. We need to spin up this this server that took months and months or multiple servers that took months and months to build. We need to deploy that like in 24 hours. Um and we actually managed to do that. And today a lot of this conversation is all the lessons that we learned from that and how that sort of levels to your stack resilience. Um I will say this is highly dependent on a few things. Um the first one is architectural uh choices that laid early early in the process and I'll get into those architectural choices because I think that is actually the most impactful of of the things we're going to talk about today. Then the the secondary piece of this was the Flowfuse platform benefits. the things that Flowfuse has that makes it actually capable um to to facilitate something like this. Um and I kind of made this joke a little earlier. I had a meeting right before this one where I was like it sounds like a flowfuse sales pitch but this is actually like critical platform functionality um that you really need to think about how this levels up to your own architecture to your own systems. Um and then the last piece that what made this actually work was the team expertise and the breaking of tribal knowledge. Now, I put this at the bottom here, and I want to sort of double down on this and point out really clearly that siloing is the death of your architecture. And when I say siloing, I'm not just talking about team silos, although team silos are also quite bad. I'm talking about knowledge silos. Uh knowledge silos in the sense of, you know, the people who were on site, we had sales people, we had a few marketing people. um we didn't we had one or two product people and and like one person who was borderline engineer um and to if we were completely siloed to the point where the marketing team was the only one who understood this and then the the code accessibility we had only ever developed it internally had never shared anything out. If that had been the case this would have been a very different story. But the ability to communicate across teams and then make that data portable, make the systems portable. That is really what saved this and all it was sort of the key to unlocking this entire thing. So with that said, let's talk about the factors for resilience. Uh before I get into the specific factors, I do want to show you the actual stack that we're working with. So we'll start over here with the actual FlowFuse platform. So, I'll give you a brief rundown of what our stack actually looked like and then I'll show you the physical devices that leveled to all of this. So, we had the the hosted instances. The hosted instances were the easiest part of this to actually get working and to get stable. So, we had a handful of things going on here. We have a top level instance. The top level instances is very easily just the sort of top level um agentic stuff, the sort of conversational pieces, the things that you would experience as sort of the the front end of the entire stack. From here we have a couple other systems. We had a couple test benches that needed to be created. That way we could get in and troubleshoot and do sort of iterative development. We also had a couple uh bridges to other systems and so you know if we want to do ERPME stuff we have a bridge for that. Um we have a booth generation uh instance which was just doing sort of like hey walk up give us your information uh if you'd like to know more and then that gets stored locally and then we we also had uh as I said the site management and site management really was the multi-sight management of the entire sort of factory stack. um what this stack looks like and I'll show you the actual simulation in a moment. This stack was all based around a a story of of multiple data types. We had um the the story line of okay, if you're doing something like gene therapy in order to make gene therapy work, you have to have a bunch of systems working together. And all of those systems had different data types. So you have data types like uh you know in the Fabashi site we were doing some let me close this back. There you go. In the Fabashi site we were handling completely different things versus you know what we are handling say in handover. Handover was a biomed uh system. It was things like uh the cold chain temp so doing cryofreezing. It was medical mixing. It was temperature regulation, humidity regulation. It was a lot of different things. This is a completely different data stack compared to something like the San Francisco site. The San Francisco and Fibashi site uh those two were working in concert to make centrifuge devices. So we had one that was doing wafer cutting for for RAM modules and then we uh well wafer cutting for for memory modules that would then be assembled into RAM and then the RAM assembly is part of the entire centrifuge device. So we had a lot of systems at play here. Um, those were all of the, as I said, the the hosted, you know, the the cloud stuff.
In order to make all this work, we also had a bunch of remote devices. And this is where the story gets very complicated. Not only did we have some of the handler things, so bridging, you know, to ODO or bridging to Quest DB, um, which I'll show you in a minute, we also had things like a camera server.
The camera server was basically taking a a live webcam and taking that data taking snapshots of that of that image and then uh uploading it through uh flowfuse for processing for for sort of agentic processing of what am I looking at what's the object uh basically the idea that if you have a biomedical service then you might need somebody to constantly be next to a certain machine before the process can begin and the camera server is a perfect way of doing that and then we have things like the factory sim that handled all the data uh you can see we just have a whole bunch of different things going on. these devices were all local to me and so you start to get the sense of okay well the hosted stuff that not impacted by this at all right doesn't require me to be on site but when I'm holding on to the physical hardware that all of a sudden becomes a much more critical problem and that being said I'm actually going to point out some of the hardware here so if you see right behind me this is a 10-in rack and inside the 10-in rack we have let me get my thumb oriented in In the 10-in rack, we have some specific devices, some specific devices. So, in this 10-in rack, we have a 2018 Mac Mini and then another 20 uh 2018 Mac Mini.
One with uh 64 gig of RAM, the other with 32. One of them was doing all of the sort of local LLM AI processing, all that kind of kind of stuff. The other device here, this was doing all of our database management, a bunch of our flow fusions and stuff. Then we also had this device which is a Raspberry Pi that was hosting a lot of the computer vision sort of uh conversion from just video feed over to um image generation. And then we had a whole host of other devices. So underneath this lid that I'm pointing to right now we had some on logic devices and those on logic devices were meant to drive the displays of the demo. And then uh below this we also had two Groove Rios that had a start stop button had a green red you know standard button and then an andon stack. And so you can see that although the cloud stack is really easy to manage that's something that we can just you know spin up good to go whatever the remote stack is much more complicated. And that's what was stuck with me. And so the question was, okay, is there a way that we can get all this spin up very, very quickly? And what's that look like? The answer was yes, we could get that spun up really quickly. And there's a few factors that made this possible. The first factor is eliminating your single points of failure. When we were developing this, there was some sort of spidey sense in the back of my head that said, "Hey, um, it's a little concerning to to fly with all of this." There's something about that that was it was it was worrisome. And so as we were building all this out, the question that I kept coming back to was, okay, what happens if one of these devices breaks?
In my mind, the planning was, you know, the the thing I was most worried about was the Groove Rio. If I'm going to put two Groov Rio devices in my suitcase, what happens if something crushes my bag and then the Groov Rios are gone? Like what how do we recover from that? And so to get out of that problem, one of the things that I did really proactively, and I'll I'll share what this looks like over here. Um, and and what we tried to do with this stack is to make sure that everything was wasolated down to individual instances that could be packaged, that could be rebuilt, that could be put back out in some way. So as you can see this computer vision uh virtual you know this docker um stack this was the alternative to running it on the uh Raspberry Pi. The idea was okay if something happens there should be the exact same code that I have the exact same broker that I have all the systems that I have I should be able to just immediately deploy that because you know if I don't have this and for whatever reason the Raspberry Pi gets destroyed that means we're in a world of hurt. Um, if you look back on this side, you can see that each of these factory devices are again its own container, its own stack, its own data system. And the entire point of this was if something goes wrong with one of the devices, we should be able to rapidly say, okay, the RIO is broken. Let's just deploy the exact same code that we had before and let's already have it running. Let's have it running on like a second MQTT topic. Um, and if the first one breaks, then we have the second one that we can just rapidly pivot towards. And that's really critically important, especially when you're talking about sort of factory resilience. I think that is the sort of top level understanding of what that actually looks like. And I on the factory side of things, the OT side of things, I'm sure you all have seen you have a factory line that does a bunch of production and you may have a secondary line that could be spun up or you might have a a point of contact where if someone is sick, you can call that person. So in my mind, this is sort of like the base level of what made this work was having less single points of failure because if there's less single points of failure, then you have less things that need to depend on that one person. So that being said, that is sort of the the step one here. Um when we start getting outside of that and again eliminating single points of failure is literally like the basement of what you should be doing for resolving this bus factor problem. The next thing that changed the game was observability. Now, I want to be really clear that when I talk about observability, I'm not talking about the data generation necessarily. What I'm talking about is the context of that data generation.
When you're working with a whole host of of devices, I think for this deployment, we had like 26 machines, some hosted, some remote. When you have all those machines working in concert together and you have to rapidly iterate, you have to rapidly deploy because you have that one person that dropped out. Um, what you need to know is not whether those devices are running or not. That is like the simplest, easiest thing. You don't even really need to know about the data they're generating, although that's helpful. What you really need to know is are those devices communicating with each other? Is the dropout between those devices significant enough to affect how you how you code, how you develop? And then fundamentally, can you redeploy in a circumstance that is different from the original deployment, but in a way that is as mirrored or as close as possible to your original deployment?
Let me show you what I mean by that. I can get into any of these devices and I can show you um we'll look at Quest DB Handler cuz that's one that is currently running with quite a lot coming through.
If I get into the performance side of things, the ability to see, okay, this device was a local device. This was a device that was supposed to be hosted right next to the buttons, right next to the factory generation. It was supposed to merge those things together. Now all of a sudden, we have to roll these out on a virtual basis. They have to be taken from one instance, taken to a brand new instance. I need to understand a lot of things. I need to understand the CPU utilization. I need to understand the memory utilization when I get into the brokerage of this data.
When I actually open up the the data itself, I need to know whether that data Let me get out of this. Actually, let me get over to a hosted instance so I can show you another piece of this. I need to know when that data is sent back and forth. Am I actually able to trace that request back and forth? Am I actually able to say, "Okay, I have made this, you know, request to my local LLM. has it processed in an efficient enough way that I don't need to move it closer to the generation source. Um, and that starts getting important when you start adding in latency, delays, lockers. Um, when this was all local, when this was all something that could just be uh here's a good example of our MCP buildout. When all of this was local, that's one thing, right? That is easy enough to to deal with. It's easy enough to like manage when you start making requests to for instance this Fabashi site data and this is dependent on a server that I'm holding next to me in Japan and when that request is going to servers in the United States and then is going to servers in say Ireland and then is finally going to servers in Germany that multi-step hop is significantly different from the original state which was just local provision local deployment Being able to measure that and being able to say, "Okay, this request has come through and it is tracking in the way that we thought it was tracking or is not tracking. Being able to even just deploy something like a simple debug to pull out, you know, this is an API call.
Um, the API calls being triggered in Germany, handled in Japan, routed back to Germany. I need to know was that was that um that handling introducing so much latency that the request is failing because if I see on the debug side of things I see that the request hasn't gone through successfully. The the person who requested this is now getting the information and I'm the sole point of failure. Okay, I have the whole stack that's deployed here. If I don't have any amount of observability into the stack, I can't tell whether the problem is the device, the problem is the code, the problem is the API is down. Giving yourself as much contextual understanding as possible resolved this problem handily. And so when we rolled out all of this to handover, the very first step I did was collect every single point of of observability we had, every single piece of of RAM information, of CPU information, network latency. I ran a whole bunch of different requests because we had built out all of this observability as a base requirement of the code. I was able to say, okay, we have 2 milliseconds additional latency, that's good enough.
Um, but then that also allowed us to make decisions like, hey, for the buttons, if you push the button and there's like a 10 millisecond latency, that's enough latency or god forbid 50 millisecond latency or 2 second latency that starts to get to the point where it's noticeable. It's something that starts to affect the experience. Okay, let's now redeploy the buttons, but in a physical sense. And that's actually what we did with this data is we saw the remote handling was not enough. we have to actually move it locally and we spun up a whole bunch of local devices. The only reason we're able to make that decision is because of the observability piece of this. So that's highly highly uh critical to this process.
So let's get into um the next side of this and this is portability. When we talk about portability, there's often this this idea that it is specifically around code or it is specifically around the deployment of the thing. And while that is um largely true, there's a little bit of a caveat there.
The caveat being that it's not just a matter of where the data is sourced or where the code is operating that le that leverages or levers up into portability.
Portability is also the modality of data sharing. In our case, we had architecturally made the decision not to simply directionally send a request but instead to broadcast. So let me explain that a little bit. Um, if anyone is familiar with um with the the IT side of things, you might kind of get the idea of directional versus broadcasting. For those of you who aren't familiar with it, the idea of sending information or request or processing directionally is basically I have this one instance and this one instance is going to generate my requests. It's going to generate my data. It's going to do all that and once I get the information from that instance, I'm going to send it to a single instance and that's going to collect all my data. It's going to be great. While that is efficient in the sense of resource handling from point to point, the problem is it depends on both of those points existing at all times.
The second you have dropout, the second you have any, you know, latency issues or you have your lead architect stuck in another continent, you run into an issue where that thing fundamentally breaks.
And so, as we were developing this, knowing, you know, or fearing that I could lose a device, that a system could be broken. The question was, okay, how do we get around that? What is the structural architectural way to get around this? The answer for me was don't just send it to a single device, broadcast it. So let me show you what that looks like on FlowFuse. When we have something like this processing for factory events, the way that this works is anytime someone uh triggers like an estop that generates um something that gets written out to MQTT as a factory event that an ESTOP has been triggered and therefore we can go look at why it was triggered, what was being produced, what the loss was, etc., etc. When this is generated, you can see that we're actually broadcasting this message. We have the option to send the message to the instance, but instead of sending the message to the instance, we choose to broadcast and we choose to broadcast with a specific topic that could be sub subscribed to by anybody.
What that does is a few things. The first thing is it changes the nature of that data communication. It breaks the silo of the data itself. Now this isn't appropriate for all kinds of data.
Certainly you don't want to broadcast everything at all times but for something like this which is almost infrastructural in nature right this the the the way this process happens just to give you a little bit of an overview is when we want to query this this QuestDB event we generate the payload.
We cache it locally and then we send the message out to QuestDB. This then goes out to whatever the remote device is.
Once the remote device processes the QuestDB call, that call comes back to us through this uh node which brings it back, marries the the cache which is part of the observability that I was talking about, gives us an idea of what instant processed it and then releases it to the MCP response for a reply. The problem is again if this was just sending to you know Raspberry Pi 20001 or something you know something like that if that device goes away then what this is doing is it's screaming into the void and there's no understanding that that is the problem until we go looking through each device. If you're getting a rid of those single points of failure, if you're getting to the point where this data could be handled with failover, if you instead choose something infrastructural like this to be broadcast, then not only can you broadcast to any device, which in this case was critical cuz then we can move from the physical device to the virtual one. Um, as long as the instance is listening to that same topic, you know exactly what is happening. You you're able to pick that data back up. you're able to have like a seamless handling and you can very quickly say, "Okay, the original device has collapsed. No problem. Let's go back to this other device." The other sort of portable element here is choosing multiple places to have that data live and then choosing where you're pulling that data depending on your need. Um so for instance these write to staff table query staff table these two systems are actually using flowfuse tables and for get staff data for instance if we have a get staff data request we build the staff query whatever it is if we say like hey who's the manager of the San Francisco site it'll build that query and they'll send it out to the query table and then what we get back from that uh it's basically just a SQL query this will come back down through the MCP response Importantly however on the other handling for Quest DB for this that staff data is also ported out and written to the database.
And what that means is that if we ever have to get rid of the flow fees tables if it becomes corrupt if you know we were using an external partner and they just stopped offering the product. Any of that stuff would not be an issue for us because the data itself is a is a first class object that is portable across multiple sources. And so we can say, hey, our first source of truth has collapsed. Move to the second source of truth. And so you can start to see that this isn't necessarily portability of data, it's portability of paradigm. So that was extremely helpful for this process as well. The next piece of this is actually uh I I feel like a seldom understood benefit of flowfuse and really a seldom uh understood risk of of not using flowfuse and that is granular rights and access controls. So when we talk about granularity, it is it's sort of a balancing game, right? Because you want to get people as much access as possible. You want to say, "Oh, the stack has collapsed. Isn't it great to have like a hundred people all able to to fix this thing? The problem with that is as much as you're distributing the ability to build the the ability to respond, the ability to iterate, you're also distributing risk.
And so, as I said earlier, you know, we have QuestDB that's handling all of this data. It's really pulling in um by design is pulling into one instance that has a failover instance elsewhere. If I were to give people right access to that and you know they re reset the system or erased old snapshots or did any of these additional things all of the the sort of house of cards that I've built of the layered security of the layered portability of the layered um reactivity all that essentially goes away right it becomes something where I'm not just distributing the iteration of the thing I'm distributing the risk and so the question was okay if I want to get granular rights and I want to get granular access control how do I distribute capability without distributing risk and on flowfuse the easiest way that we did that was was assigning teams when you're in flowfuse everything is aligned around teams applications groups these sort of things um this team is comprised of a bunch of people there are some people that were local to to the site. There's some people that are remote partners and each of these people have their own permissions. But in some cases, you know, if I want to say if I want Sherman to have access to, you know, any of these applications, I can just give him ownership. It's whatever, right? If I have someone locally who just really needs to be able to see a thing and work on a thing, but not necessarily be able to go in and make huge changes like adding new people to the team, changing the way that we're pushing data forward, then I can assign them a member role. If I want even tighter controls, if I want someone to be only able to view something critical, so like the test benches where we were iterating on on the rapid fixes that need to be deployed, I can then implement a an even stricter limitation. Now, one thing that I want to make really really clear here is that this is as much a sort of thinking ahead of time problem as it is a on-site problem. Um, you obviously can't know. Like in my circumstance, I was stranded. I didn't know who was going to, you know, be going out there to fix it. We had three or four people who we weren't planning on being in Germany all of a sudden had to go to Germany. And so that is something that we can plan for ahead of time. Um, you you can't you can plan some of it, but you can't plan all of it, right? And so if you can't plan for the people who are coming out, you can at least rapidly respond to that. But ahead of time, you have to give some thought as to how you would actually implement this if the worst case scenario happened. In my case, I knew, hey, if the worst case scenario happens, there are some applications that I want people to access. There are some applications I want them to not even be aware exists.
Because if I'm handling all of the data, I'm totally fine with them iterating on the dashboard while I'm sleeping. I'm not fine with them iterating on something that could do a you know an errant write to the database drop the entire table everything gets broken. I don't want that. And so having early planning for it it's not just good for security although it's very good for security. This should be sort of a a core function. It's also really really good for resilience because then you're not fighting other people in your team to get the thing fixed. You're fighting the problem as the problem currently exists. Another critical piece of this is documentation, documentation, documentation, documentation.
Again, when you're having a a resilient deployment, when you're having a a system that you weren't expecting to be broken as all of a sudden broken for whatever that reason is, you need to have ample documentation. And this is again something that is sort of the the pre-thought before the problem happens.
that is the most important time to do this. So all of the stack that we developed there was internal documentation. They were sharing with the team ahead of time, sharing build states, sharing sort of daily, hey, this is what this looks like for the people who were on the ground. The second that we knew that there was a problem, getting everybody in the same room and saying, hey, this is a critical problem.
We're going to have to pivot very quickly and here's the documentation for how this works. That in and of itself is a huge huge value multiplier and and huge um solver of that bus factor problem. Again, the bus factor sort of assumes a handful of things. It assumes that the person being removed either has the the tribal knowledge, the siloing that I talked about at the beginning or they have the the physical limitation, the physical devices and all that. You can solve half of that. you can't solve the physical problem of it like you know having documentation doesn't make it so you can teleport to Germany uh with the full server stack what it does do is it means that the fixes to all those problems are something that can be implemented by anybody and that by anybody takes you from a bus factor of one to a bus factor of 20 or 100 right in theory anybody who was on the site could hop into a call with me and understand what was going on over the course of like 10 minutes. Um, and that means that your bus factor almost becomes infinite, although that is sort of a weird way to say it. Um, but documentation is a huge piece of this.
The next part of that and related both to the documentation as well as eliminating single points of failure is this idea of version control and snapshotting. Um, so very quickly, version control and snapshotting, they are often talked about in the same sentence, but they're slightly different things. Version control is really having an an understanding and a mechanism by which you can say, okay, this is a breaking change. This is a a major iteration. This is a small iteration.
This is, you know, something that we understand. And so I can look at the version of the thing that I have. I can understand what has happened.
The snapshotting um is part and parcel to that but it's a little different. It is taking a a slice of of understanding briefly taking a literal snapshot of your deployment and having that be sort of your fallback.
And when it comes to the problem that we had on site, this honestly was one of the biggest things that we actually um that we actually used. When you look at any of these instances, if I pull up I I'll show you hosted instances and I'll show you the remote instance because they're they're both relevant to this conversation. If I look at something like site management, when you get into the version history of this, you can see there's all sorts of of versions every every you know time I did something.
There's a very clear thing that was generated. There's also an auto snapshot. There was also additional to that some purposeful snapshots. So if I went over uh here you can see that there was a very purposeful backup that said hey this was a critical change. This was a a critical functional change. Let's get away from auto snapshots because if we have to rapidly roll back to a thing we have to rapidly iterate on the thing.
I just want someone to know that this was the last time this looked good. And when you look at what the snapshot looks like, we have a whole a whole host of changes, a whole host of systems. Now, what is important about this is making sure that you're actually being clear about the the context of it that this was the backup that was happening and I alerted the team that this is happening.
This is the backup at this period of time. Now, why does this matter for the resilient side of it? There's two two sort of branches here. The first branch is the branch that is related specifically to your ability to have a resilient stack. Your ability to say, "Okay, um, an iteration has happened. I don't like this iteration. I need to roll it back." That's one piece of it. The other piece of it is the ability to take that resilience and multiplex it and distribute that solution forward. So, real quickly before I continue, I do have another poll question for you all that is relevant to this section. Um, which is how resilient is your stack? Everything that we just talked about, put yourself in in your sort of mindset of of all the solutions that we've talked about, is is your stack actually able to do any of what we've talked about? So, there's a couple answers here. Um, is it highly resilient, somewhat resilient, etc. There's also a secondary question here, which is how do you currently manage your stack? uh resilience. Um so we've talked a little bit about duplication and failover, layered ownership. Um what we're talking about now would be roughly version control. Um and also maybe some automatic rebuild.
So the the two branches that I'm talking about, the first one is really simple.
It is okay, I'm going to bed. I'll see you guys in the morning. When I wake up, does the thing still work? And if it works, great. I'll make a snapshot. we have good iteration, we'll move forward.
If it doesn't work, when can we roll back? What backup can we roll back to to make sure that we we have the thing that we, you know, want. That's one half of it. And that's resilience in the sense of preventing issues from happening. In our circumstance, we had a really interesting problem because the issue had already happened, right? I was already stuck in a foreign country.
There was no way that my magic was going to solve this. And so the second benefit that we got from this was the ability to take these snapshots and roll out as a fix. Let me show you what that actually functionally looks like. When I first started developing these systems, I had, as I said earlier, used some Groove Rios. Now, these RIOS were were limited to me. They were physically located next to me. There wasn't, you know, any ability that we were going to to have this. I wasn't going to ship it to Germany. It wouldn't make it in time.
And because I had gone through this process of version control and snapshots as a functional requirement of the codebase, everything that I did had to be documented, had to create a snapshot, had to create this thing. Because I had these backups, I was able to then take this exact current state, export this, and move it over to the new instance.
And in a couple minutes, I had the exact same data system, the exact same because everything was being broadcast and everything was tied to our ENQTT broker.
I was able to take this flow and just copy it, paste it, and be done. And so if you have adequate snapshotting, that becomes a problem of, okay, our systems are down. We need to fix it. That becomes a problem of like a a several hour fix to a several minute fix. This does require developing your code with this in your mind that you know I could get hit by a bus and then all of a sudden the the stack collapses. But if you adopt that mindset and you adopt this as a process, it becomes much much easier. The last critical thing here um is actually a pretty complicated it's it's almost a little bit of an esoteric one and that is this idea of decoupling production from consumption and decoupling generation from communication.
Now, what I mean by that, because that's that's a very petty concept, right? What we're talking about here is any device that produces something should not be the sole device that then pushes that data or consumes that data. So, our factory simulation, our Groove Rios, all these devices, they were writing their state out, but they were not the critical ones that were ingesting that data into the overall platform. It's not like we had the Groove Rio as the only way you could interact with a button.
And so if that Groove Rio dropped, there was no way to interact with the buttons.
It was the Groove Rio is writing to this topic. And whatever writes to that topic is then consumed by a hosted instance or a new local instance or or what have you. The decoupling of that is quite significant. When you look on the code level and I'll show you all of the all of the stack that we have right now. If you look at how all of these systems actually functionally worked. When you look at the way that these systems interact, the way that all of the platform works together. If I were to have everything in this top level, okay, this top level has some AI functionality. If I were to take all of that AI functionality and have it in this nice little device called AI bridge, if that AI bridge goes down, my recovery for that is not just redeploying the code. It is redeploying the code, changing how the devices talk to each other, figuring out how to ingest that data back in, figuring out how to reassign the roles, figure out all of these additional systems that are way, way more complicated than they should be. I don't want that. If we're in a critical problem, what I really want is to be able to just say, "Okay, create a new instance and we're done."
By decoupling the consumption and the production, the generation and the communication by taking all of these devices and pushing over to our broker.
And you can see here our broker is absolutely massive. Each one of these things is being written to constantly by having that separation out. If one piece fails, it's not like we have to rebuild the entire stack. we just have to rebuild that one node. And that's critical. That's really one of the few things that like made this process as rapid as it was. When I say we rebuilt after, you know, 24 hours had redeployment, it it was literally like the same day we had I think like 90% of what we wanted and then from there it was all finer details for the rest of the time. And that's quite significant.
That's not just a small sort of jump over. that is a huge huge value multiplexer.
So the story here um is really one of designing your system around the worst case scenario and being aware that this could happen. That is like 99% of the problem. It's one thing to say like okay in the future this could be a problem.
That is so abstract and and not real.
Knowing that it can be a problem and then planning for it being a problem from day one means that when the problem happens and like knock on wood, it'll never happen to you, right? But if it does happen, as long as you're applying these standards and these sort of factors, you can rapidly respond and then you're talking about a recovery of a few hours instead of a recovery of a few days or months or worse. Awesome. So yeah, as I said, this is an absolutely wild uh event. Did not even imagine this was possible. Um, but this is the reality of where we're moving sort of in mass right now. I think we all have to be aware that like the world is much more complicated than it was 10 years ago or 50 years ago. And these uh sort of issues are unfortunately going to become more commonplace. Um, and so as those issues become more commonplace, we need to start thinking about them as a common issue in the business and not just some random happen stance that could in a blue moon happen. Chances are it's actually going uh more likely to happen. So, thank you all for attending this talk. If you have any questions, feel free to email me. Feel free to go to flowfuse.com and click the contact us button. Without any other further ado, thank you so much for attending today's session. Uh and we will talk next time and until the next deployment.
Thank you for coming.
関連おすすめ
resume fixed instantly 😭 Comment “app”andI’ll sendyou the link #parakeetaipartnership #resumetips
Ritcareer
686 views•2026-05-31
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
3D Basics in C
HirschDaniel
2K views•2026-06-05
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
Making Minecraft Clone with C++ & Raylib
PecaCSLive
686 views•2026-06-04
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30











