This analysis provides a rigorous architectural framework that demystifies the transition from simple LLM prompts to production-ready, autonomous engineering systems. It correctly identifies that the true challenge lies in stateful orchestration and secure execution environments rather than the model itself.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Design Claude Code Like a Senior Engineer
Added:Let's design an AI code execution agent like the one that powers cursor or cloud code. These tools are now part of every software engineers daily workflow and companies are also starting to include AI native systems like this one in system design interviews. So we're talking about the cloud code extension version or the CLI version and the cursors agent mode in the code editor.
What we'll do is we'll start with the design requirements first to see what we are building in this system design interview or overall for the system.
Then we'll come up with the highle design on how we are going to connect the components together. Then we'll scale it to handle the kind of traffic that cursor and cloud code are serving for the users. And lastly, we'll cover security, how to handle the guardrails and how to handle the failures in this type of system. Let's get started.
Before getting to the high level design, let's start with the design requirements to see what is the challenge that we need to build here. So, we are not building the whole code editor like VS code, cursor or any code editor that you're using. What we are building here is the agent mode panel where you can describe a task and then the agent writes, executes and reads the output and also iterates until it gets it right. What this means is the agent mode for example in cursor where you have on the right side panel and specifically the agent mode where you can select delegate a task to it. It generates the code inserts it into the codebase and tests the output or it can also be cloud codes extension or CLI where you again describe a task here and then get the output generated by the orchestrator that you have here. So this includes two type of requirements functional and non-functional. Let's start with the functional requirements which are the functionalities that this agent mode needs to have. First of all, of course, it needs to accept a task input. So as a user, you describe a task in natural language and you use the extension or CLI or the right side panel in cursor.
Then it needs to also have codebase understanding which means the agent can read files search through codebase and build context based on this for the project structure. Next part is code generation and editing. This is where you need to use the LLM to generate the code and agent can also write the new code and edit existing files in the codebase. Apart from adding and generating the code, it needs to also be able to execute it. So it can run commands like build, run tests, start the app, install dependencies, and check the outputs that it generated.
And in case it's a difficult feature or bug that needs to be fixed, it needs to have an iteration loop because it might not get it right from the first time. So if tests fail or the build breaks or the code it generated didn't work, the agent needs to be able to read the error and try another way to fix the issue. We also need to have checkpoints. So if something broke by the changes from the LLM, users like developer needs to be able to roll back to any earlier state of the session so that it recovers the state where the code was working. And an important part of this is permission control because some actions need user approval before they can be executed like deleting files, running shell commands. Of course, you don't want the agent to accidentally drop your entire database from staging or something like that. You need to have some guard rails for this to control the permissions. And that also ties to the first requirement of non-functional requirements which is the security. So when running this agent, you can run it either in local environment which is your cursor editor let's say or it can be cloud or background agents where this is running in an isolated environment in the cloud.
In case it's local the commands run directly on users machine with the users permissions. So the agent can touch the real file system and the database that you have locally and also your real credentials from the cloud provider. So here the security boundary is the policy that you need to have for running the commands and in case it's cloud environment like background agents then execution happens in an isolated sandbox with limited file system so that it has limited network and resource access in the cloud environment. The other nonfunctional requirement is the latency. So we need to be able to provide the first token of the response within one to two seconds. Of course, full tasks can take minutes and sometimes even hour, but the user must see progress the whole time and see what the orchestrator is working on. It needs to also work reliably. So if a crash happens mid task like it was a feature or bug fix, this should not leave the codebase halfedited. The changes by the agent need to be atomic recoverable and it also needs to be able to continue from where it left if it crashes in the middle of an operation and it needs to be of course cost efficient because you can design this in a way that it burns a lot of tokens and it will be a design constraint in that case. In terms of scalability, we need to design it later for cursors or cloud codes scale. So we'll try to add mechanisms of scaling the system after we have the highle design and we need to have context limits. The system must pick the right slice of code to show to the model.
Otherwise if we start dumping all the code into the LLM again we will hit the cost requirement issues that we will be burning tokens for no reason. And as a last requirement, it needs to be provider resilient, which means we shouldn't rely on one LLM only. If you again look back to how these models are working in cursors agent for example, you can select Sonet, you can select chat GBTs, OpenAI's GBT model, Gemini's pro model and many other models that you'll be able to choose between. And in case some of these fails or let's say Antropics API becomes unresponsive, we need to be able to switch to another provider so that the system keeps on working and executing the task. After clarifying the requirements, we can now start the highle design for two scenarios specifically in this case. So first one is if you're running this in a local machine and then the second will be if you're using background agents or running this in a cloud. So let's start with the first one which is if you're running this in your local machine on your laptop. The starting point here is the agent panel which is the panel in cursor or it can be the terminal version of cloud code or it can also be the extension. So it doesn't matter. This is the userfacing client that the user is interacting with. It is mostly used for capturing the task that we delegate to it to show the streaming output and render the difference in the file changes when we have it from the orchestrator. So this is a thin layer compared to others because it has no super complex logic. This is just the userfacing part of the application. Once you submit the task, this is where the user task is forwarded to the orchestrator agent. This is where majority of the logic lives of this type of system. Orchestrator is the agent runtime that owns the entire loop from the planning phase to delegating to iterating until it solves the issue.
This manages the conversation history, decides when to stop and also manages the context. So when history gets too long, it compacts all steps into a summary and starts a new chart with the LLM model. Now where this runs is a design decision. This can be either in the cloud or it can be also in your local machine. In case it's a cloud orchestrator like the cursor's backend agent, this keeps the loop server site and pushes the tool execution to the client over a persistent connection.
This gives you durability and centralized control in the cloud. But the downside is every file read crosses the network and raises privacy questions. So it's a trade-off between having durability and centralized control over the network bandwidth and privacy issues that you will have in case it's running in the cloud.
Orchestrator also uses context engine.
This context engine is what decides what the model is. the LLM model like cloud or GBT for example when you index your codebase for the first time when you drop it into cursor or cloud code it indexes it first time maps the repo like the file tree function signature scrap and lets the agent read the files on demand after that this is one of the strategies for also optimizing the tokens because we are embedding index for semantic search like where is the authentic application handled where is the order processing handled in case it's an order processing system as an example. It also has its own session store for history and checkpoints. So this is what will make sure that we have checkpoints throughout the code like a local file or DB depending on local versus cloud mode. If it's in cloud then we have it here as a session store and we can revert back to the previous version if let's say the recent changes broke something in the codebase. Another strategy for first of all saving tokens but also keeping the orchestrator context clean is to use sub aents and most of the modern tools are using sub aents. They are not working as a single orchestrator or agent. So you have your main agent which is the one that has the overall context about what you delegated, what is the planning phase, what needs to be done and then you have sub agents like there is a piece of code that needs to be generated. In this case orchestrator will just delegate it to this sub agent to generate the code.
Both the agent and sub agent need to be using some sort of AI gateway right to connect to these models like OpenAI, Antropic, Google. So to plan and generate this your agent and also sub agent connect to some sort of AI gateway that you need to set up. So one option here is you can sign up to open AAI get an API key and use it to set up the SDK or direct API calls and same for entropic for Google and so on. We are using multiple type of LLMs first of all for variety. If it's a complex task, you can delegate it to set or fabulous.
And if it's a small change, sometimes you can use small models like GBT mini models and so on. And also it gives you that resilience in case one of these fails, you can switch it to some other model to keep it running. So in that case it makes sense to have some AI gateway centralized where you're connecting to all of these LLMs in a single place. So a good option that you have here is the versels AI gateway which uses LLM calls with failover and it gives you one endpoint to access all these models. The way this works in the codebase is you install the AI SDK and then you can just switch the models like this like OpenAI's GBT 5.5 or Sonet's fable or any other model that you want to use. You just switch it easily like this. And the benefit is it doesn't add any extra fees on top of what Google Antropic or OpenAI charges. So you're still getting built the same price but you're using a centralized dashboard for accessing all these LLM agents. Now once the code is generated with sub agent and the planning delegation happened you need to be able to insert this into the codebase to try the changes. Right? So this is where you use the task executor.
You delegate this to the permission scope layer first. This is part of that security. If you're running it locally, you need to scope the access. Like you can only run commands within this repository that you're working. And if it needs to run commands outside of that, it needs to ask for user permission. And same goes for running commands for the first time, writing to the files, usually first time it asks you whether you allow to run commands, write to the files and so on. And after that you can set it to be either automatic or you can approve or reject every time. So in case you approve it can start reading files. It would use this first as well to understand the codebase with that context engine for example when it's indexing your codebase for the first time. But also after that it might read the files to see the current state. So in this case it interacts with your codebase. It might try to run tests in which case it interacts with the codebase plus the terminal to run the tests and for running commands as well. Of course, it interacts with your terminal. So now, let's say you need to insert the code.
It will use the write file function or tool that you have here. So you can have many tools here available. These are just some examples for the agent. And if you allow, what this will do is just write to the file system. It will modify the codebase where it's necessary to achieve the functionality that you described. And then it will run the commands like it will run the app. If it's a front end app, it will run the backend service and then test the endpoints. So after that, it will run the tests either your local tests or it will come up with its own tests to validate the results. The major security issues here is that all the commands run directly on the user's machine, your local machine, and which means there is no sandbox, no isolation of the environment. Because of that you need to always ask for permission scope layers that sits between the agent and the machine. So to first verify whether we can run these commands, whether we can modify the files and then only use the tools to modify things in this machine.
Apart from that of course the orchestrator will keep on iterating. So in case let's say the tests fail or it wrote to the file system but couldn't achieve the result, it will keep on iterating until it gets it right or until it hits the rate limit. And it will know about it by getting notifications from the tool calls like if it called the write file, if we couldn't write to the file, we get the error message to know what to do next.
If we ran the test, let's say, and then it either failed or succeeded, we get the output here to see if it succeeded.
If it's all good, if not, then we'll keep on iterating. Now, that's in case you're running it on local machine in the developers, engineers environment.
What if we are using the cloud version and this is more popular among VIP coders who are just delegating it to the cloud version. they don't run it on their own machine and then you need to have some sandbox to run this and validate the results. So the starting point is again the same. You can use either the background agents in cursor or cloud codes web version to delegate tasks to it. This is the client where we accept the input. After that point, it connects to the API gateway where you have the authentication service connected to the API gateway to see if this user is authenticated first of all and then what plan they are using. Are they on a pro plan on pro max plan or whatever they're using and hence we will also get the limits for the user to proceed to the orchestrator service. If they are authenticated and they can use the pro services then we proceed to connecting to orchestrator service. This is where again all the loop happens starting from the planning to delegating to sub agents to iterating until the task is complete. Again here we connect to the same AI gateway. So connecting to open AI entropic Google and any other LLM providers you want to use. Apart from that what we will have in the cloud is radius for the session cache that keeps track of active sessions and also caches the responses and you of course need also a durable storage like postgrql for the execution history.
These can be used for the previous chats. So the ones that you're actively working on they will be stored cached in the radius and the previous chats will be stored in a persistent storage. Now one thing that's majorly different from the previous version is you need to run the code and in the previous version you could run it on your local machine but here we don't have a machine anymore. So we need to create some VM to run it and ideally it will be an isolated environment like the sandbox in Versel where you have the safe code execution environment and you use the same tools that we had there like writing files, reading files, running tests and you can have AI generated code that you insert directly to that sandbox environment and then run it because it is safe. It's in an isolated environment. Even if something breaks, it only breaks in this sandbox and not the entire machine that you're using. Now, what if we need to scale this to handle millions of requests and not only single request or couple of users for the agent? So, agent will still remain the same no matter if we are using the CLI locally or we're using the cloud version. This will connect to the API gateway first where authentication happens. Then we use the token limits per user based on the authentication and the user details.
What we'll add here is first of all a message queue. This is where new session will be created for every task because especially in the free tier if you remember there were limits where you delegate a task to it and then it puts you in the queue because a lot of people are using this. So we prioritize the pro users first and then the free users who are trying to use it they get into the queue but also for the paid users if they are using a lot of most in demand models then we still might put them in a queue to make sure they get executed in order and then we'll add a service to consume these Q messages like a worker pool that consumes cute requests and then spins up an orchestrator for each of the messages. is so this is where we'll use the same orchestrator but we can start also replicating it like we're doing it with servers. You can have multiple orchestrators and the goal here should be to try to keep them stateless so that every next action can be delegated to any of these orchestrators in the pool. And the way we can achieve it is by moving the session state out of the orchestrator into a durable workflow layer. Like if we again stick to versel as a cloud provider that we will use then there is another service which is called versel workflow here each of the agent step is a durable step. So if an orchestrator instance dies let's say this one dies and it was working on some task in that case any other instance can pick up the session from the step let's say 30 where it died and it can keep on working on the request that you had continuing from the same step instead of starting from zero. Now this orchestrator will still use sub aents for having isolated context. For example, if you're using cloud models, you have let's say 100K or something token limit, right? If it's Gemini, it can be up to millions. But still, there is a limit to the context that you can use. And once this runs out of the token limit, it needs to either summarize or it just keeps on working and forgetting the previous requirements that you gave to the agent and that starts to degrade the answer qualities. So that's why we are using sub aents and sub agent gives the summary only back to the orchestrator so that we are not again overwhelming it and not burning the tokens. No matter if we are using sub agent or the orchestrator all the tool calls here we'll be going through a guardrails layer to check the type of tool call. For example, if it's just read only like reading the file system, it can be automatically allowed. If it's something reversible like running npm run build for example, if it's a react app then again it can be automatically approved based on the initial permissions that the user has set up.
But in case it can be something irreversible. There is always human approval needed like if you're trying to delete something from the database or if you're trying to force push something to the GitHub. All of these type of operations need human approval and this is where we will have guardrails layer and it can operate as a code instead of having an LLM because it can hallucinate. So here it needs to be like a code function or service that checks whether the operation contains any of the irreversible patterns that we have set up in the codebase. So in case it is approved or it is manually approved by the user again here you will have either the local machine where you need to execute or the cloud environment. Let's start from local machine. In this case we use the permissions scope layer. So user sets generic permission rules for the whole codebase like it allows reading always it allows running commands npm commands always but let's say for some other specific operations you need to ask for approval assuming it is approved or it was in the automatically allow list operations then we get access to the tools next that we need to run let's say it was a read operation then the tool of read file gets fault and we read it from the file system or if it was running command then we start running it in the terminal. So in this case, user is the sandbox, meaning the laptop or the computer is the sandbox where we're running. That's why we need to be extra careful in this case. And the other option is if you have the cloud version. So if we again stick to the same cloud, then we can use the sandbox pool here for spinning up multiple microVMs and have one sandbox per session. So in this case we can also use the fluid compute service from Versel where you get build for active CPU only because in most of the cases these instances will be running but they are not being used and you're still ending up paying for these VMs because you're running them and getting built per hour. And in both cases whether it's the local environment or the cloud environment we need to have a checkpoint layer that shadows git. And this is basically a g version history but for the agent itself. So in case it broke something in the last conversation that you did with it, you can have a checkpoint like that reverse button that you have in the cursor and you click on it and you reverse to the old version of the code where the functionalities were working. Especially after you keep on chatting with it and the context limit gets filled, it can start hallucinating, forgetting the old requirements that you gave and then start breaking the code in which case you need to have a way to go back in the checkpoint layer and this is stored in a storage service. So we need to have all the conversation logs because you need to be able to access all your previous history and also snapshots. And another thing we can use here to reduce the costs of infrastructure is some sort of call storage because not all of these conversations are important for the user. If they have 100 previous chat history with the agent, they are most likely not interested in the previous like older 90 conversations, right? They are interested in the recent 10. They might browse through these to see what they were working on recently. And then the other 90 we can just pull them as needed. And in that case we can just store the old conversation in a blob storage and only keep the last 10 available for the user to access. And the last layer here is when the orchestrator connects to the AI gateway and then it connects to all the available LLMs that you're using. You can also add a prompt cache layer here because the system prompt and early conversation are identical across the 50 steps of a task. So we can catch the prefix and cut the cost and also latency massively in this case. So this is one of the biggest cost levers here that you can also have for the entire system here. And lastly, you can expect to also talk about some failure modes or in case you're building this like you need to be able to handle the failures. So what if for example agent loops forever because it can get stuck between talking with the agent running the code and it doesn't work and then next time it again doesn't work. So it never ends up working and it keeps on working forever in this infinite loop and burns the tokens. What you can have here is a hard cup on iterations for example where you detect a repeated identical tool calls and in this case you just let the user know that the agent is stuck to try some other way and instead of keep on burning the tokens. There can also be execution failures like the command hangs forever that the tool tries to run and then that server never exits and it ends up staging in the infinite loop and then the orchestrator is waiting for the command line tool to be executed. In this case you can have for example timeouts on every command and background process management to kill this session and then run it again after the timeout.
You can also have a crash in between the session which leaves the files half edited. And in the requirements we had that it needs to either pick up where it left and continue or it needs to revert back and in this case we can use atomic patch application. So one option here you can create checkpoint before every file edit and then validate it after the edit was done and in case something crashes in the middle you have this checkpoint to go back because all the operations were atomic on the files edited and one of the biggest issues which we tried to solve with the guardrails level but still agent might decide to do something irreversible like drop a table delete files outside the workspace or force push to GitHub. In this case, you need to have strict security layers like action classification in the guardrail to ask for mandatory approval for the irreversible operations. And there needs to be a list of these irreversible operations that whenever it's detected, the model never gets to execute this directly.
Related Videos
LBF101 Creating an XML Changelog
liquibase7511
3K views•2026-06-15
Alta Labs Cloud Dashboard Real time Network & Xnet Insights!
ShinyTechThings
158 views•2026-06-17
Wait... Group Policy Not Applying? Check This First!
keeplearning_iT
144 views•2026-06-15
Leetcode Weekly Contest 506 | Life's boring these days
Pudeesht
2K views•2026-06-14
microJAM: MAKING A MICRO GAME FOR A GAME JAM IN CLOJURESCRIPT AND TOTALLY NOT C
janetacarr
156 views•2026-06-18
Partitioning vs Bucketing vs Clustering: How to Make Queries 100x Faster
thedataandaiguy
194 views•2026-06-16
Linus Torvalds: AI Won’t Replace Understanding Code
SavvyNik
140 views•2026-06-19
Programming in English
MattGodbolt
584 views•2026-06-14











