When AI agents need to work with realistic data (such as product catalogs, pricing, user data, or game economies), they require isolated database environments to experiment safely without contaminating the main system. By forking databases and giving each agent its own copy, developers can run parallel experiments, compare results, and selectively promote successful versions while discarding failed attempts. This approach transforms database management from a single-shot process into a parallel exploration workflow, enabling agents to freely explore and test different approaches within bounded, disposable environments.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Ghost AI let's AI Agents build disposable worldsAdded:
So, I've been experimenting with building these games slash LLM benchmarks recently. This has been something that's really fascinating to me, seeing how well AIs can learn a new game in real time or have them populate a virtual world and see if they can make sense of it. As these projects grow, trying out experimental features gets a little bit scary. You want to try five different things. But you also don't want your project getting messed up beyond all recognition, especially once you have enough data stored where losing it could genuinely hurt. Giving an AI agent right access to your database should sound like an idea that is genuinely terrifying. And really, it is.
If an agent changes some code and gets something wrong, that's usually easily fixable. You can see which files changed. You can compare the new version to the old one. And if the AI agent just goes off on some random tangent in some weird direction, you can just throw away that version and get back to the old one. You don't lose the whole project.
Software teams already work this way.
You try your ideas in separate versions before deciding what belongs in the final product. But databases are different. A database is not just another file in your project. The database is the state of the world. It's your users, your orders, your product catalog, your game economy, your analytics events, your settings, your history, etc. It's all the stuff that makes the application feel real. So, if you tell an agent, go improve this product, the agent doesn't need to just change a button or rewrite a paragraph.
It may need to touch the data model to rewrite examples. It may need to try different pricing logic, maybe create test users, maybe make some mistakes.
And that's the part that feels dangerous unless each agent gets its own disposable copy of the database. And that's the idea that I want to explore in this video. Code already has a safe way to try different versions. databases need the same kind of workflow, especially when AI agents are doing the work. So, here's an actual project that I've been working on. This is to test how well different large language models can learn. I call it Gravell GPT.
Basically, how it works is you have these four gravity wells, aka suns, and you have three little ships. the gravity, inertia, physics, all of that is realistically calculated. The LM is tasked with writing a script that is going to control the bots. It doesn't actually control the bots. It writes the code and the logic for these little bots, these little ships to basically fly around. The goal is for them to not crash into each other, not crash into the suns or get even too close to the suns and be pulled into it. And they get points the longer they can stay within this moving disc. I'll give you an example. So here in blue we have cloud code opus 4.7 and in red we have codeex gpt 5.5 high. Let's click replay and we're going to accelerate it just a little bit. So notice it's controlling the three ships by writing the scripts, how they behave, how close they're allowed to get to the sun, how much fuel they're able to use. And the goal is to stay inside this circle as much as possible without crashing. Each time they stay in a circle, they get a plus one score. And the goal is not to crash into the other ships, into the suns, and also not run out of fuel. The whole thing runs for 200 ticks and whoever has the highest score basically wins. Now, these are pro pilots. They don't start out this way. Here's kind of the big trick about this benchmark. The models aren't just given one try. They're given 30. Each time they write code, it gets played out in that simulation and then a feedback is given to it. Here's what happened to your ships. Here's if you crashed. Here's just everything that happened. And with that information, they're able to rewrite their code. So, as you can see here, if the model is smart, this chart increases. Its score increases over time. That means it's learning from its mistakes and it's improving its code. So, I've tested a lot of these models out since I was using the API. All of it costs money, especially for the more expensive models like Opus 4.7, GPT 5.5, etc. I tested them in a solo environment as well as a PVP environment where they both kind of have to fly at the same time. At the time of the testing, which was about a month ago, Claude Opus 4.7 reigned supreme. Everything was going great, and then I noticed a problem. Here's the thing. The first iteration, the ships are flown horly. They crash into each other. They explode. They waste too much gas. They just fly out of this solar system or whatever you want to call it.
Here's an example. This is the first attempt of cloud code opus 4.7 and GPT 5.5 high. So, here's its first attempt.
Notice something blows up right out of the gate. All of them fly out of bounds.
They come back way too high. Barely touch. They're just all over the place.
They're wasting too much gas. They're just out of control. This is iteration number one. Now, compare that to when they're like 20 or 30 iterations in.
Notice they don't try to follow the circle. They kind of wait and let it come to them. They don't use a lot of gas. They're not blowing up. They're not getting too close to the suns. They're just using small little bursts of gas to try to line themselves up. These are pro pilots that have figured out what they're doing. They've earned their stripes. They've earned their experience points and the score reflects it. They went from getting something like 20 points per round to, in this example, 100 plus. You could see the learning rate happening in real time. It was amazing. I had the AI agents running sometimes through the night just testing stuff out and updating everything, updating the database. But then one morning, I noticed something weird. All the models, including some of the smaller ones, the ones that I would assume wouldn't work as well. They all started doing really well right out of the gate. Some of them would clock in at 300 points by the time it was over, but there wasn't a gradual improvement. They just started out good and they stayed pretty good. What happened was somewhere along the line, one of the AI agents that I was using, it decided to add to some code somewhere or to some instruction that I was using for the API calls. It decided to add a hint. It decided to tell these large language models like, "Oh, by the way, here's a great place to start. Here's some kind of like code to get you started." That code was the best found code from all the experiments that we were running. It completely defeated the whole point of the benchmark. It corrupted the results.
It corrupted the leaderboards and it wasted quite a bit of money. One AI agent derping out without me realizing was extremely painful. Now, one of my earlier videos was sponsored by Ghost.
They reached out. They wanted to do a sponsorship. Their idea was simple.
Create databases for AI agents to be used as almost like scratch pads. Spin up three different databases. Get three different agents working on them. Just keep the one that worked and the other ones you can just throw away in the trash. And after doing that earlier sponsored video, I decided to actually do a full video that's dedicated to Ghost and what they do. This is a sponsored video. Thank you to Ghost for sponsoring this video. But I did want to showcase them because if I knew that this existed just a month ago when I was doing this, my whole life would have been different or at least the amount of frustration that I felt when this whole thing happened, it would have been a lot less. So ghost is postgress built for the way that AI agents actually work. So the important thing to understand here is that Ghost isn't just another place to host your database. The interesting thing here is that Ghost is designed around the command line and the MCP because the person using the database might not be a person at all. It might very well be an AI agent. An agent can create a database. It can inspect what's inside of it. It can run queries. It can fork that database. It can run an experiment on it. Then it can just delete that fork or it can keep it for later review. Ghost also has a very generous free tier, unlimited databases and forks, one terabyte of storage.
There's no weight list, and they have hard spending caps, so a forgotten experiment is not going to turn into a surprise bill at the end of the month.
So if you're building with cloud code or cursor or codeex or any other agentic coding environment, this is the part that matters. Ghost gives the agent a real Postgress database that the agent can manage through MCP. It's super easy to set up. You install it with one flow.
Use this curl command here, type it in, hit enter, and it's ghost time. You can choose to configure the whole thing here. Let's walk through that now. We'll hit enter to confirm. It'll give you a link and authorization code to get authorized. Select which MCP clients you want to install. We're going to be using codeex today. So, we'll select that and hit enter. Or if you prefer, type ghost login and that will get you logged in.
And then ghost mcpin install. And that'll take you to this screen where you can select what you're going to be using. Again, I already have a codeex configured and ready to go. But you can do cloud code, cursor, Gemini CLI, Google Anti-gravity, Kirro CLI, VS Code, Windsurf, etc. Everything is spelled out on the website, ghost.build, and there's some simple commands for creating the databases, for listing them out, etc. But we can do all that through codeex or whatever agentic coding tool you're using. So now moving forward when I build stuff like this where AI agents can mess with my databases I make sure the first step is it actually creates multiple ghost databases or Postgress databases and then does the experiments separately. So Codex created two separate ghost databases seated both the same schema but different economy conditions. As I'm creating different experiments, I make sure to tell CodeEx to create several different databases.
Here, for example, three and spin up three separate workers. So, three sort of sub aents that each work on their own site with their own database, etc. So, each agent is responsible for their own version of the website with its own database, with its own assets. This, by the way, is my version of AI Village.
It's not quite complete, but this time I'm not going to lose a bunch of my efforts because of one AI agent derping out and doing something silly. Here, by the way, is kind of what that's looking like. It's not fully complete yet.
There's a number of different villagers, each going about their day. There's a a time. You can play it fast, slow, whatever. And four of them are actually being controlled by different LM models.
The big point here is that your AI agent can create databases from the terminal or through MCP. But instead of talking about Ghost as just a database management product, I want to focus a little bit on the bigger shift. What changes when your database can be copied, tested, compared, and thrown away as easily as an early draft? We have gotten used to AI editing code. Not perfectly, not without review, but the basic workflow makes sense. You ask an agent to make changes. It edits the files. You check what changed. Maybe you run tests, maybe you reject it, and maybe you keep it. This works because code is easy to separate into different attempts. But the moment that the agent needs realistic data, things get awkward. If you're building an e-commerce site, the agent needs products, prices, coupons, checkout states, abandoned carts, analytics, maybe customer segments. If you're building a game or something that looks like a game, the agent needs rooms, enemies, loot tables, player progression, play test scores, balance settings, etc. This is not just code.
That is the world that the code is operating on. And if every agent is working inside the same database, they start stepping on each other. One agent changes the pricing table, the other changes the product copy. The third changes the analytic events. Now the page looks different, but you don't know which idea triggered it, what caused it.
Even worse, you cannot compare the attempts cleanly. You don't have three separate versions. You have one messy versions with three AI agents fingerprints all over it. The ghost mental model is simple. Start with one base database, fork it multiple times, give each agent its own copy, and then let them work in isolation. The code and app stay the same, but the world underneath the app is different. When agent A opens the app, it sees the world it created. When agent B opens the app, it sees its own world. Same interface, different database. That is the core idea. If AI agents are going to be doing real work, database state needs a safe way to branch. This matters because agents are not just calculators. They are each their own search processes. You often don't want to take the first answer. You want several plausible attempts. That is especially true in creative work. If you ask three people to design a game level, you don't expect all three of them to create the same map. The valuable thing here is you can run several attempts, compare them, and just keep the best parts. In a game workflow, you can run play tests against each fork. You could score completion time, difficulty spikes, balance, replay-ability, or whatever matters for that specific game. Then you can keep the best fork or send it to a human for review. The key is that keeping that fork is a decision. The agent doesn't just pour changes into the main database. It works in a separate version. You inspect it. You compare it.
Then you decide what moves forward. That is the difference between chaos and experimentation. Chaos is three agents writing into the same world.
Experimentation is three agents writing into three separate worlds. Chaos is a ladder sounds great on paper. But in practice, not so much. I think this is the mindset shift. Disposable does not mean careless. Disposable means bounded.
Means the agent is allowed to explore because the damage is contained. If the level is bad, delete the fork. If the data structure is wrong, delete the fork. If the agent creates nonsense test data, delete the fork. But if one fork is surprisingly good, now you have something very real to inspect. You can run the app against it. You can compare it to the original. You can ask another agent to critique it. You can run test against it. You can determine if it makes sense to move forward with it.
This is what makes this different from just letting an agent randomly change things. The agent gets freedom, creative freedom, but it's inside a safe container. And when the work is done, you either throw that container away or promote it. So, as I've created three different takes on this AI village, they have different names, different layouts, different data. I also wanted these agents to create a unique landing page for each. So, this is version number one. One agent did this with its own database based on one version of the game. This is agent two with version two and agent three with version three.
Notice how it actually has a replay of the state of the game within it. So imagine you're in a situation you want to create a few different versions of a website or a landing page, whatever the case may be. You want to test a few different directions. Maybe one version is more direct and practical, one is a little bit more editorial. Today, these AI agents are pretty good at making a surface level page. They can write the copy. They can change the layouts. They can generate all the components. But the real product pages aren't just static HTML. They connect to product cataloges, pricing tables, plans, checkout, copy, analytics, events, etc. So if an agent only changes the surface level of the page, it's working on a fake version of the product. The more useful version is this. Each agent gets its own copy of the product database. The different agents try different approaches. One is the simple comm one. One is the more aggressive one. One has multiple experiments and split tests running within it. None of these are automatically the right answer. It gives each agent the chance to try stuff out.
Each agent can change the records behind the page without contaminating the other agents work. Same app, same website shell, different database worlds. And because these different versions and different changes live in separate databases, you can evaluate them cleanly across the board. So this is related to AB testing, but it's not exactly the same thing. AB testing usually happens when you have real versions ready to go and you're ready to send real traffic to them. The agent fork happens earlier.
This is the messy creative stage where AI agents are creating possible versions before you decide which one of those has the potential to actually be shown to real users. Most of these will probably never reach and the end stage, the production stage. And that's kind of the point. You want agents to be able to explore versions that are too messy, too speculative. They're just not right for the main database. So the loop is simple. Copy the database. Let the agents explore, score the result, then promote the useful version or delete it.
This becomes more important with agents because agents can work in parallel.
Notice here in codeex the main agent, the manager agent says launch one asset integration worker per site. These three separate workers, one per site, they worked in parallel. Altogether, it took close to an hour. So, if this was one agent, this would have taken closer to three hours to build it out. But three of them working in parallel. Well, technically four, cuz there's one sort of overseer agent, the main codeex agent that's kind of putting them into production and checking the results at the end of it. A human developer usually works on one version at a time. A team of developers might have several different versions happening at the same time, but each person is still moving at human speed. Agents change that. If you can run 10 agents, the bottleneck is no longer if somebody can type the code that fast. The bottleneck becomes can you give those agents a safe place to work and can you rate and review the output at the end of it? Can you compare what they produce afterwards? For code, we already have a mature workflow for this. For databases, the answers have so far been kind of clunky. You might manually copy a database. You might create a staging environment. You might use a local database. You might try to keep things synced by hand. But that friction usually means that the agent gets one of two bad options. It either gets a fake environment, meaning it's not solving a real problem, or it gets a shared environment, which means it can make a real mess. Ghost is trying to make the correct thing the easy thing.
Fork the database and let the agents do the work, then delete or promote the result. In practical terms, Ghost is a managed Postgress with an interface built for agents. You still get the normal Postgress databases, but the workflow is built around the command line and MCP instead of a big visual dashboard. That means that the coding agent can manage more of the database life cycle itself. It can create a new database. It can fork an existing one.
It can inspect the structure of the database. It can run SQL and it can delete the scratch database when it's done. And because it's Postgress, it's not a toy database that just exists for demos. The point is to give agents a real relational database, but with the branching behaviors that agent workflows need. And the other big practical piece of this is cost control. Ghost offers 1 TBTE of free storage, unlimited databases and forkworks, and hard spending caps. That matters because agent workflows can get weird fast. If you're going to let agents create infrastructure, you're going to need boundaries. The promise is not let the agent do anything. The promise is let the agent explore within a bounded system. The bigger point is that AI software development is moving from a singleshot generation to parallel exploration. It's not just make me one button. It's try three product strategies. It's design five different onboarding flows or generate 10 different levels, play test them and see which one's the best. These workflows are only clean if the agents have their own isolated state because the moment that multiple agents share a state comparison gets muddy. If two different agents change the database, which one caused the lift and which one broke the data model? Which one changed the price?
Which one created the weird edge case?
Separate database branches preserve the experiment. This is why I think this is the missing piece for agentic code development. We already know code needs safe versions. The next step is realizing that the world behind code needs its own safe versions as well. If you want to try ghost, go to ghost.build.
There's no wait list. You can install the CLI and connect it to your agent with this one command. Type in this curl command, then ghost login, then ghost mcp install. It's as simple as that. You get 1 TBTE free unlimited database and forks and hard spending caps so agent experiments stay bounded. For me, the interesting part is not just that the agent can create its own database. is that the agent can create its own copy of the world. Try something real and interesting inside of it and then you can decide what survives. This is the workflow that I think we're going to be using a lot more of as agents become less of an autocomplete and more like parallel builders. Code has safe versions. Database state needs safe versions and Ghost is building directly for that. Thanks to Ghost for sponsoring this video. Link is in the description.
If you made it this far, thank you so much for watching.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











