AI models consistently hallucinate regardless of pricing tier, but deterministic task orchestration systems with quality gates and approval processes can significantly improve reliability. The Babysitter project demonstrates this by implementing structured workflows where AI agents must first understand the user and project context, then follow predefined methodologies like TDD (Test-Driven Development) with explicit approval gates before executing code, ensuring deterministic outcomes rather than unpredictable AI behavior.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
I'm NEVER Letting AI Work Without a Babysitter AgainAdded:
I don't know where AI is going. And let me tell you a little secret. No one does. But there's one thing I do know.
It misses its target a lot and often.
Now, I'm a $200 a month paying customer to OpenAI. I have been for Enthropic.
I've tried every model out there from Open Router Quirks through Open Code Zen Go and Black. And regardless of your tier or how much money you're throwing at it, AI hallucinates all the time.
Now, sure, we all set guard rails and rules, but uh >> ignore that. AIS, they don't really follow the rules.
>> Well, what if you had a system? Not a magic wand, not a trick, an actual way to orchestrate tasks with deterministic outcomes. I don't know about you, but if someone's selling that, I'm buying. And here's the good part. No one's selling.
It's open source. It's there to help us enforce obedience and aim for hallucinationfree output. Too good to be true, right? I thought so, too. So, I tested it because I'm your guinea pig.
Let's get into it.
So, A5C.AI while presenting strong vibes of a VIP coded landing page sounds like they're all in on that open source project. It's by developers for developers, which is something I haven't heard in a really long time. The premise is that everything's a blackbox, so there's no real audit. The context window is strangling you. And you know who says when something's done? That's right, not you. So if you took quality gates, made sure there's a loop until we meet a certain criteria and definition of done is a real thing and not a metaphor, your chances of success change dramatically. Now knowing its audience, the project comes with four different options. Essentially, the call you can force a plan only and then have the ever popular yellow and forever mode. More on that later. babysitter gets to know you, then the project, then and only then it starts working. And just so we're on the same page, there's no data send. It's all open source and model agnostic. You tell it what you want, approve, and you let it do its thing. Not groundbreaking on a first look, and that's why we're zooming in. The project is on GitHub. I wanted to follow by saying on GitHub, of course, but that's not a popular stance these days. Anyway, made by a mysterious team called A5C who seem like they're dedicated to this process and supporting large organizations should they want a paid assistance. I think the first three words here are enough to understand the premise of the entire thing. Babysitter enforces obedience. Taken out of context, that would sound really bad.
So, it supports your codeex cursor, Gemini, Copilot, Open Code, and yes, yes, our pie is here as well.
Everything's tagged experimental, which is kind of funny, but isn't everything experimental for the past couple of years? Babysitter comes either wrapped as a package or extension with your favorite agent or as a standalone harness. The reason is oftent times these can and should run on ephemeral environments like CI workers or GitHub automation bots. So you can have Babysitter do its thing without being dependent on a running agent around it.
To get the harness, you'd npm install A5C AI babysitter SDK. Once installed, you can call it directly by addressing the harness and call, which is one of the four modes we'll soon cover.
Internal is the harness we'll be using, but you can always use claude or other options as orchestrators. Then the process, which is usually a JavaScript coded, yeah, not markdown, actual code in the workspace to run in. We don't have any processes at the moment. So, let's start properly. make our agent run Babysitter for us. Pi, the agent of choice and a popular video on the channel if you're interested, has it listed as one of the optional extension packages. Pi install and we're ready to go. Fire it up, which now shows us a bunch of new options. There are too many options here. Just run Babysitter help on the CLI and you'll see what I mean.
I'm intentionally focusing here on the essentials, but we can cover the more advanced options in a future video.
Comment below if you're interested.
Taking a quick break for a word from today's sponsor, Keeper Security. Many people know Keeper Security for password management, but for developers and DevOps teams, it goes way beyond that.
Keeper is a zero trust and zero knowledge identity security and privilege access management platform. It lets you eliminate hard-coded secrets, so passing credentials around in Slack and get files off your local machine. It does that by embedding security directly into your existing workflow. Keeper secrets manager gives you SDKs for popular languages, a CLI for working with vault secrets from the terminal, integrations for tools like Terraform and CI/CD pipelines, and it supports secure secret injection for AI agents through an MCP. There's also Keeper Commander, a powerful command line tool with a new SuperShell terminal user interface for automating vault operations, managing credentials, and working with back-end systems at scale.
Because Gipper is built on zero knowledge encryption, your data is encrypted at device level. Not even Keeper can access it. So if you're managing passwords, infrastructure secrets, remote access or developer workflows, Keeper is worth checking out.
Go to the link in the description to request a demo and see how it fits into your existing workflow. Now back to the video. We said it first needs to know you and then your project. So this is your first command. Babysitter user install. I picked the option to set my own details here so we can go through what kind of depth this thing goes into.
Once ready, we can start creating our markdown files for profile on boarding.
I actually went ahead and asked Pi to boilerplate this one for me. You don't have to have everything here. Remember, we want the context quite short. So, name, specialties, and experience are welcome to level up the responses and preferences of verbosity. Keep it short, specific, and for the love of God, emoji free. Not sure how my profile helps here, but you know what? If by scanning my GitHub, the agent can tell whether I'm leaning towards Go and Shell rather than JavaScript and Pearl, we won. Now, this thing took a long time to run. Yes, I'm running GPT 5.5 high or X high because I figured I'd use my tokens, but this also means slow work and babysitter on top, and you don't really get the Usain Bolt of AI agents here. So, I did drop the reasoning to medium later just for things to move a bit quicker. But if there's one thing I think it's important for you to leave this video with is that we're not here for speed or quantity.
And just like the cliche, we want a proper result. So expect a bit of waiting times. With that in mind, it's time for the babysitter to get to know our project. With every new project, you want to go through the project install process, which in this case will let it explore on its own. It'll take its time to read the code, dependencies, git history, and everything it can find to build a mental model of the project. and place it under A5C. It'll ask some important questions like what are you actually optimizing for whether security, production, reliability or other stuff it detected as areas of concern like my data integrations here for example. It'll ask about your workflow whether PRs are critical and it knows about me and it already knows about me working with Work Trunk. So it asks. Work trunk is another great video on the channel. Go check it out. Once you're done with the questionnaires, you'll notice more than a few interesting things come up while it's working. One being the methodologies cemented into babysitter's brain, like debugging and testing, but wrapped in GSD, which is getting things done. A popular productivity method. There's a long list of processes. We'll take a look at them soon. After what seems like eternity, and by eternity, I mean about 15 minutes, minus a few seconds here and there for me to answer its questions.
It's quite a process. Look at my context window. It's a busy one. If you want to know how things went and your patient, it's advisable to run the doctor, which after a good few minutes will produce its own report of what went wrong and what needs another look. Now you can start working. This can be done via ad hoc sessions like pi minus p and providing the prompt in line where we'll ask a big request. And for that, instead of a call, I'm going to use yolo and see what happens. While it's doing its thing, let's build a short example to understand the mental model. We'll build a small REST API. The standard method of work is using call and asking whatever it is you like built. In this instance, a task API and all the methods followed by a request to use TDD and a quality target to pass. I want to double click on TDD here. And while I don't have proof, whenever I asked my agent with any model or project to run while tDDing its way through, I got better results.
Now, sure, it's an N equals 1 observation, but it's not just me. And also, while I was always taught to use TDD, I never actually done it properly.
But now, we have little coding minions and they can refuse yet. So, let's see this. It'll start by calling the orchestrator and the skill instructions.
That's standard. It'll then print out the plan through request endpoints and quality gates. Then it does something interesting by producing its own orchestration code. I was a bit puzzled by this at a first glance because it's TypeScript which I did not expect and then figured it's the agent building its own process. Now remember those processes it's using for quality and code. There's a very long list in the docs which starts with stuff you'd expect like TDD building web and mobile applications DevOps security and then it goes further into business areas and science. Of course, I didn't read any of these and to be fair, I'm even quite worried about the vastness of options.
But I guess that's another discussion to have with the authors. Meanwhile, our agent built a TDD plan and asks for approval before moving forward, which is one of its approval gates. From there, literally the first piece of code you'll see generated is main tests. It starts building the tests and only then the code that passes them. If we go through the report when it's done, we have endpoints in place. Tests run with coverage as instructed, which I actually hope for more to be honest with the linness of the project, but that's what we gave it. There's spec code and yeah, tests. It's so direct with your instructions. There are zero logs when you run it. Sure, we didn't ask for any, but just to show you the level of precision, which can either be good or problematic and requires verbosity from your side when planning. Anyway, it's live if we get /tasks empty array.
That's good. Let's post empty data and that gives a proper response, which means one test did its job already. With a proper JSON, we get it listed in the response. Now, let's delete and that seem to be working as well. So, getting the tasks now empty array. Perfect.
Okay, that's actually great. Extremely le precise, did what we told it to do, didn't sidestep or build a monster no one asked for. Let's see what the other crazy guy, which we basically sent into oblivion with a generic task, had done.
We have something. There's login, O, and actions. No real structure, but fine.
Running it works eventually, and we get something, but uh whatever this error means, I still don't know what's up here, and I don't want to spend time now debugging. I hope this is enough to demo the process, but also what's required from the user. I honestly meant this and I think I'm actually starting to like the idea. Babysitter, as I've started by saying, is not a magician. It's an orchestrator, but it's not your open claw that's running around drinking tokens and chatting to other agents for fun. If you're a pro, it'll be a pro employee, but the user is a huge part of the process. I'll keep running this for the next few weeks and months and report back. But I hope others here can share their experience in the comments. As you've noticed, it's not worth a lot without a proper agent wrapping around it. So, I highly recommend checking the PI agent in this video next. The linest and minest open source agent you'll ever find. Thank you for watching. I'll see you on the next
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











