A robust AI operating system requires a 5-tier architecture with failover capabilities: (1) Context layer (knowledge, memory, state) stored in portable formats like markdown files; (2) Skills and agents that are open frameworks testable across models; (3) MCP, APIs, and credentials that are universal; (4) Interface and dashboard for observability; (5) Runtime and plugins that can be distributed across platforms. To prevent vendor lock-in, developers should test skills across multiple models (e.g., Claude and Codex), implement monitoring with alerts, and create automatic failover mechanisms that redirect to backup models when primary services experience downtime.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
The Claude Code AIOS Failover Layer Most People SkipAdded:
So, we've been building an AI OS on this channel inside Claude Code natively for the last few months and in that time I've had people asking questions about vendor lock-in and also what happens when Claude goes down. Two very valid questions that we're going to solve in this video. We can get them knocked out in less than 15 minutes. So, when we are trying to build our AI operating system, we want to focus on having a portable design and you'll see by having these separated layers and things that it's actually very easy to not get vendor locked in at all. But, before we get into that, we need to understand just how many times Claude has actually gone down in the last 90 days. So, as you can see over here, currently while I'm filming this, there is already an issue with the Windows Claude Code IDE. So, people cannot currently use that. But, if we look over the last 90 days over here, you'll see that there's actually a fair amount of red and orange in here and I'm sure you've been affected by this because everybody starts frothing at the mouth when they can't use their AI. So, you've definitely heard about people complaining about this. There is a little bit more nuance to it there because just because Claude.ai is down, doesn't mean that you can't use the API.
It doesn't mean that you can't use Cohere. And even inside here, you'll see that certain things go down within each of these. So, it's not like if the API goes down, everything is broken.
Sometimes it's just Haiku that's not working. Sometimes it's Opus, things like that. So, we need to factor this in though because if we have workflows that are running and they absolutely can't run during this window and they're mission critical, that's a big problem for us. So, how the hell do we actually deal with this? And it's actually a pretty easy problem to solve. So, I'm not going to go into a deep dive into every single one of these tiers. I've done that in a separate video, but you need to hear this for the context for the point to be relevant. So, tier one is our context layer and I'm not lumping everything into memory here. For me, context is three separate things and memory forms a part of it. So, we have knowledge and this is just markdown files that define your voice, your ICP, what your business is, what you're trying to do, stuff like that. Memory is things that the AI learns about you over time itself and it extracts that from your information. Whatever it does with this is up to you and your business needs. I always take a constraint first approach. For me, I stash my stuff in markdown. I have nothing complex anymore. Then third, we have state and this just tracks your lead processes and whatever it is a part of your workflow where we are tracking certain movements during that process. It's entirely different to these two things, and your workflows depend on this kind of thing.
Now, when it comes to vendor lock-in on this type of thing or worrying about failure of it, we don't have such a big problem here because MD files can be accessed anywhere by any model. Same for memories, we can do what we want with them, especially if we've just got them stored in markdown. And if it was a rag database, you could just hook that up to another model. Same thing for state. If you are using a spreadsheet, that would still be in Google Sheets, so that's always online. Same thing if you're using Airtable or something like that.
So, catering for vendor lock-in over here is not a problem because we own this thing almost entirely or it lives in a system outside of AI. When we get onto tier two, these are your skills and your agents. So, these are the repeatable workflows that you have built out based in your business processes.
They go out there and do all of the autonomous work for you. They have references and scripts attached to them because that's how they do their job.
Sometimes they connect to MCP or sometimes they just use an API directly.
Whatever that may be, that's easily solvable as well because skills are an open framework and so are agents. So, you can easily use these interchangeably between models. Now, the caveat over here is that you need to make sure that you test them across these models because just because your output is going to look like X inside Claude, doesn't mean it's going to be the exact same thing inside Codex. So, what's very important here is that you need to make sure when you're building that you're actually catering for this during the build phase. You can easily test these by just flipping open another model.
Let's flip across to my environment here, and you can easily just have Codex set up on the panel here while you're testing your skills with Claude over here. And in no way am I suggesting that you need a Claude and a Codex subscription. That would be silly. I've got both because I do so much building and I do YouTube and all that stuff, but realistically, what you can do is choose your main provider. In my case, it's Claude. I have a max subscription for that, runs all of my AI operating system. And then if I need to, I just set up Codex to run on the API. So, one of the easiest things that you can do here is instead of logging in with a subscription, if you don't have one, just sign in with an API, top up some API cred, and then you can use this for testing inside your environment over here. Really easy to do, and it also ties in with our failover that we're going to get into in just a little bit.
Then we get onto the next part, which is agents, and of course these are part of our skills workflow a lot of the time, not all the time, but either way agents.md exists inside Codex as well.
So, these are mostly interchangeable.
The thing that comes down here is just determining their behavior is the same.
Then of course we have hooks, and Codex has hooks as well. There are a few differences between Claude and Codex in terms of hooks. I'll put a guide down below rather than going through each of the differences, so that you can have a look at them and see if it actually matters to you. The point is here though is that there is always a different way to solve a specific problem, and realistically the majority of businesses out there aren't using hooks. It's more for software development approach, or doing some kind of really cool back end things like if a stop hook fires, you're then going to go and programmatically make this happen, kick off a workflow.
There are several actions you can take, but the point I'm making is that most businesses will probably ever need to use those things. Then we get onto tier three, easy solution, MCP is universal.
The majority of what's available on Claude is also available on Codex.
There's a few little differences here and there, but nothing that is going to drastically affect your workflow, because they've catered to again be an open standard over here. Same for APIs, that's not a problem at all, neither are your environment credentials, because obviously they form a part of this, and these are all portable. So, as long as we have all of this stuff set up our environment, both of these models are going to be able to access them. Now, the difference here is that with Claude we store our MCP servers inside our mcp.json down here, and inside Codex it works a little bit differently. If we click on Codex settings over here, you'll see that we have a little tab for MCP, and we can enable whatever we need over here. This just ends up inside our configuration file, which is a config.toml file, and helps you govern whatever Codex is going to be doing, changes a little bit of behavior. It's kind of similar to the settings.json over here, almost a one-to-one identical, but this is also where you would stash MCP servers that didn't have proper configuration. So again, you would just need to cater for this sort of thing, but don't worry, I'll put it in the guide below as well, all the differences that we acknowledge between the different models. Then we go on to tier four, and that is our interface layer. So this is how we watch what is happening inside our AI operating system. Now, depending on whether you're using Cloud or Codex, there are obviously different ways that we would have to tackle this problem. For me, my command center is built primarily around Cloud Code, but you can see it gives us all the observability in the world that we would ever need for this sort of thing. And this shiny dashboard will also help us understand when something has gone down and when we actually need to have failover in place amongst a whole bunch of other things. The issue here though is that this is all built around Cloud Code. So I'm grabbing all of this information from Open Telemetry specifically for Cloud, and then also some of the JSON logs as well. But you can access Hotel for Codex at the time of me recording this. I think it's a little bit more limited, but you could still tailor a lot of this functionality for Codex if you actually need it. But something to note here, this is completely custom and it's not needed for an AI operating system to function.
This is just an observability layer, which is definitely still very important, but it's certainly not something that is going to keep you trapped on one vendor or another. And as long as you've got it built for the vendor that you're primarily using, it will serve its main purpose of telling you when something is about to go down before you need to make a change. Then next up, if you have a Telegram bot, that's easily changeable. Again, that's just an API key, simple as that. There is more functionality at the time of me recording this with Cloud, but it really doesn't matter cuz you can still do the same thing regardless.
Then mission control, which is what some of you are building out there to go and run tasks and do a whole bunch of other agentic stuff. For me, in all of my videos I've said that that's probably an unnecessary layer in the first place, and all you're doing is adding complexity to a business when it doesn't need it, and extra failure points that you don't even need to look after. So for me, having something like this is something you probably shouldn't have if you're running natively inside Cloud Code because just use Co-work. Then finally, we get to tier five, and this is runtime. So this is exactly what the user is going to be using in order to run their AI operating system. Several things form a part of this and depending on the type of the user, whether they're a power user or someone really technical, or if they're just the everyday worker who doesn't care about any of that stuff, that defines what we would use. More technical people, they're obviously using VS Code like I do inside here. Less technical people or people who don't want to deal with that kind of thing, they're going to use the desktop app. More importantly, they're going to be using Co-Work and have their projects built out like we do in our videos on here for the AI operating system model. And this is where you're probably thinking, now you're trapped.
Once they get you in the desktop software, there's no leaving, you're trapped in their basement. But that's not true either, because if we head on over to customize and we click on plugins, you'll see that mapped to the four pods of my business, I have every single skill that forms these.
Everything that I do for my AI operating system is stored as plugins and this is what I always recommend in my videos because it's the easiest way to distribute them from a marketplace to a business and take care of them between GitHub and the marketplace. So equally, this works exactly the same in Codex.
There's a few nuances around how the structure needs to look in your marketplace, but for the most part, it is identical. So if we flip on over to Codex over here, you'll see that we have a plugins tab, we have automations just like we would inside Claude, and we also have projects. So we can build out near identical one-to-one solutions if we ever wanted to switch between models, and we can move across all of our skills in the form of plugins quite easily. And then finally on this, we have routines which obviously live in Anthropic's cloud at the moment. The time of me recording this, Codex does not have a direct one-to-one competitor for that or managed agents. But the idea here is that Anthropic would want to get you in their cloud because the more routines that you have running, the more likely you are to stay with them because it all comes down to not just the complexity of moving away, but the time that it is going to take you due to that complexity or to something else. So if you have tons of routines running in their environment, you might stop for a second and think, "Hmm, maybe I don't want to do this because of the amount of effort." And that's a totally valid point, but you're still not locked in because right now routines is still running off of a GitHub repo. For the most part, you still put all your skills in there and whatever else you need to complete the role, and routines is just acting as a scheduler plus a webhook or maybe something else that forms part of that little routine. But equally, you still own that GitHub repo that does the majority of the work. So from my perspective, vendor lock-in is absolutely not necessary. But now I need to deliver on the second half of this video and tell you what I do whenever Claude goes down. So first things first, you need to be testing while you build.
Like I said, I build everything with Claude. That's my daily driver, which I absolutely hate saying and I have no idea why I said that. Once I've done that, I then go and push it into Codex.
Codex does the testing, it makes sure I can read the skill. And if you flip across to my environment over here, you can see earlier today I just said run my daily news monitor with a dyslexic mind, and it still went out there and did exactly what I needed it to do. Now the thing that I noticed here is that it took a lot longer for the skill to actually run. It took 4 minutes, which was pretty long for this, but at the same time it put it out in the exact same format that Claude would do it. So for me, I know that this works.
Something to note here though is that the more important the skill is, the more hoops that it has to jump through.
If you have any skill chaining and things like that, there's obviously a lot more testing that you need to do. I wouldn't just run this thing once and be like, "Yes, this is perfect." You're going to need to do a few iterations on this type of thing to make sure that it does get picked up in the same way every time in a way that you expect it. I have a whole video on skill chaining and things like that that you can check in the description below if you want to understand how to do that. But then once I've done my cross-agent testing, the next most important part is to triage by the impact. So I would look at all of the skills in my environment and I understand, okay, what is the most important thing and what happens if this thing does not run within a window that it's supposed to run. So if we take one of my skills as examples over here, let's just have a look at research lead.
I would say this is pretty important.
Again, I don't really care if this runs on time or not, but let's just say this is the most vital thing on the planet.
If this thing did not run, what would I do? Cool. So, I would make a list of any skill that shared that characteristic, and I would put it into a column. If this does not run at 7:00 a.m., what are we going to do about it? Then, I would have separate columns for the things that I can defer. Do I really care that this thing doesn't run now? No. But then, what am I going to do about that?
And that comes down to deferring it, meaning I'm just going to run it whenever Claude comes back online.
Either that will be manually, or we can set up something to make sure that anything that was scheduled and couldn't run just kicks off when it can. Then, the final part here is drop, and that's for the things that we don't care about.
Like, I don't care if my news doesn't run for the day. It's not the biggest issue in the entire world. I can run it manually if I want to. And if it doesn't run, that's okay. I can just drop that.
What we're really doing over here is trying to decide what needs failover, and what type of failover does it need?
Can it be manual, or can it be automatic? For most of the stuff that gets deferred, you can probably just run that manually later. You could set up something that does the scheduling for you whenever it comes back online. But for the critical stuff, we need to figure out a plan for that as fast as possible. So, the first thing to have is obviously to have monitoring in place.
If we can't see what is happening, we're never going to know about it. And then, we can never actually fire alerts to our cell phone, whoever the hell is watching this thing, maybe Slack or email. But the point is, if we don't have monitoring, like this shiny little dashboard over here, we could never know. So, what my dashboard does is it monitors Anthropic for each of the metrics that we saw on Anthropic's dashboard. And whenever one of those things go off, if it turns out that one of my workflows is affected, it will fire me an alert. So, that gets reported on over here, and I could then set up some form of automation behind that. So, it could be something simple, like just getting a text message to my Telegram, or we could have automatic failover, where as soon as that alert fires, it takes an action. So, let's say that it tried to run a skill, and it couldn't do that because Anthropic was down. So, then we would just have a simple if statement that if Anthropic was down, then we need to send it into Codex, and that's where it comes into our magical VS Code environment, where we have our API key set up, and this thing is ready to go. It's inside our AI operating system already. It has access to all of this. We've tested the skills. We know they work. So, all that thing needs to do is send the exact same headless command to Codex that it would normally send to Claude, and it will then go and run the skill. And we know that this thing is listening, ready to go as soon as it needs to. I think auto failover is only going to be needed for really, really important workflows.
Realistically, as soon as you get an alert, you could just come in here and run it manually. It's not the end of the world. You also have to factor in things like how long is Claude down? Now, some of them are up to 8 hours. Average is probably about 45 minutes or less. And again, it's not every single model. So, a lot of the time you might not even feel this kind of turbulence. But saying that you need to do fire drills, so you need to test this stuff. After you've tested your builds and you've checked that everything works, what you can do is you can actually turn off your network to Claude's service specifically, and then you can easily just see if the auto failover works. So, I'll go into that concept in another video because this is going to drag on for too long. I'll definitely stash the information down in the guide below, along with prompts that you can use to set all of this up for your environment.
And any of the step-by-step information that you need to make sure that you are not vendor locked in, and that you have some form of failover whenever you need it. As long as you've been following my other AOS builds, you should not be trapped into any form of vendor, and you should also have your environment set up in a way that it will function with any model. The final thing here is that instead of a Claude.md like we have specifically for Claude, obviously agents.md is the open structure that we would be using for Gemini or for Codex.
So, just make sure that you have a spare one of these stashed inside your environment over here, and then it will seamlessly switch between agents. You can see my agents.md has the exact same stuff as my Claude.md. It's an exact clone of that.
Other than that, you shouldn't have any problems. If you have any comments though, put them down below and I'll get back to you as soon as possible.
Otherwise, check out the videos on the screen now. They'll definitely help you on your journey. Thanks very much for watching. See you guys later.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











