Effective AI agent evaluation requires realistic simulation environments that incorporate rate limiting, dynamic world generation, and adversarial elements like prompt injection attacks to prevent brute-force memorization strategies and ensure agents develop genuine reasoning capabilities rather than pattern matching.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
BitGN Agent Challenge: ECOM1Added:
Okay.
Hello to everybody.
And I think I should fix this with the camera.
Okay, please say something in the chat if you can see me.
And hear me.
Okay, let's see.
Okay, thanks Victor. So, you can see and hear me.
Okay, let's give a little bit of time to let more people join.
And then we'll talk a little bit more about the challenge and why this one was particularly complicated, messy, and challenging for everybody.
Me, agents, and the humans.
>> Okay.
So, welcome everybody to the BG&M e-commerce one challenge.
And >> [snorts] >> huge thanks to Calibrex one for sponsoring the challenge and helping us to shape our work, like the work that we do as a community to push state of the art towards something that is more related to closely to the business.
So, those who are new here to this challenge, you know that you may not know yet that this is not the first challenge in the series.
We have been working already for more than a year, I think, holding this amazing AI research and development community a series of challenges.
It all started as rack challenges or enterprise rack series from enterprise rack one challenge to ERC three challenge.
And in this series we were learning how to build just better racks.
Because back then racks were state of the art and the task was, "Hey, we have 100 PDFs that were loaded from the public business reports. Let's see if we can find answers to the questions."
And this work amazingly well with because with every iteration what we've done is we learned a little bit more and we revealed that. So, in the first two ERC challenges we learned that structured reasoning works, that actually schema guided reasoning can allow to have really, really lightweight agents that uh can beat the benchmarks and that was taken by the people and integrated by actual teams that built and shipped commercial agents into their architecture designs.
Uh then Enterprise Action Fleet came, which was more about simulated runtimes and the worlds. And there we've uh discovered that actually you can use evolutionary architectures uh and kind of use agents to develop agents. This is the same thing that uh Andrej Karpathy was uh talking and writing about.
Uh and then at the beginning of this year uh I quit my uh corporate job and we founded BTM to create uh a platform to make it easy.
And thanks for the hairstyle because I've been we've been working really hard for the past probably week because of [laughter] unexpected changes.
Uh and I'll uh share you uh the details with you shortly, but the hairstyle is really warranted.
Okay, uh and the idea so uh with the BTM platform was uh to make it easy uh for all teams around the world uh to get real experience with building agents.
Uh and normally if you're an individual that wants to learn uh building a agents or a team, uh it will be fairly hard for you to get some practice. And the reason for that being is that in order to work on an agent, and we've seen this pattern many, many times uh in production business cases at the enterprise, it actually takes 95% of the time and budget and effort on a project to create a proper uh data set. To create proper evolves. And it takes a lot of effort and work and uh because that kind of gets in the way, uh a lot of teams don't do that. And because they don't do proper testing, they don't do proper data sets, they don't do proper evals, uh they end up with something that works on prototypes and maybe works not as an expected uh in production. And then uh also it is not easy to maintain. There are lots of regressions. You fix one prompt, a lot of things break in other place.
So with B DM uh platform, what we wanted to do is uh one person would take the effort, one team would take the effort, and create a challenge.
Uh and that would be the 95% uh effort thing uh that encapsulates the data sets, the evals, uh and we make them available uh for public for free.
So anybody can use uh B DM platform uh and uh build an agent that will be immediately graded against these uh benchmarks and challenges.
And additional thing that we're trying to do is that we're trying to make these benchmarks and challenges as close as possible to the real business problems.
This way anybody in the world real world can play with building an agent that solves uh modern day-to-day tasks.
And in the B DM series, uh we already had a personal uh assistant competition pack one. Uh that was focusing on uh creating an agent that in open core style uh can do things for you, but at least in a reliable and productive way.
And this pack one challenge, uh it was actually a painful thing.
Uh because I've done something something really stupid.
So in pack one, we've discovered that uh Codex-driven agents uh can work really well out of the box, uh and these were the insights, and we shared these insights with everybody. Because that's the point of B DM. And now suddenly everybody, the entire community, knew that you can just take a complex benchmark, you can throw throw Codex uh in a loop, Codex CLI in a loop, or Claude in a P mode at a task at the benchmark even at just at the endpoint or the website, and it will get decent results.
So, people started using that architecture because it was very well explained in the beginning insights on the website.
You can add a little bit of sandbox. You can add a little bit of memory, and then suddenly things get much more interesting and much more fun.
But, that wasn't enough for all of the people.
Some people, and that's the amazing part, wanted to push this further.
So, some people are saying, "Okay, Codex is nice, but this is running on subscription. It might be subsidized, and I want to build something that will work on local models because this is a challenge."
So, some people are taking the hardcore out, and they're trying to build agents that don't use the frontier expensive models, but they use something that can be run on GPU and locally.
So, that's one category. Another category did something even more interesting. They said, "Okay, so Codex can beat this challenge. It is fun, but I want to do something better. I want to beat it faster." And how do you beat achieve something faster? One thing is you just throw away large language model out of the agent.
So, people here they push evolutionary architectures to the edge, and they created a setup where meta [clears throat] agent carefully studies the benchmark. It runs the challenges over and over again to see the permutations.
And then it kind of builds and maintains an agent that is built completely on string matching and pattern matching and rejects.
So, that thing will never work in the production, but it kind of uh beats the benchmark. And you can actually build on these strings and regexes uh agents uh actually their repositories. Thanks uh actually a lot for sharing that.
Uh you can build an agent that passes the benchmark uh at least back one dev uh within 15 seconds.
Sometimes uh you might need to use rust for that, parallelization, uh but it just beats it so fast. It just memorizes the benchmark in a very efficient way using uh Codex and smart agent to memorize that.
And that was humiliating.
Because as soon as I would publish I would like create a complex task, I'll publish it uh on the dev leaderboard, on the dev benchmark set, and within half an hour it would be solved.
And then people would not even know how it was solved. Uh they would just get the results. And this is kind of humiliating in a good way because I was feel uh like the competition was supposed to be like for everybody else, but at this point like everybody else was Codex and uh Claude, and I was the one competing against Codex and Claude trying to create the benchmarks and challenges uh that uh are kind of unbeatable, but at the same time they have to be realistic.
But Codex doesn't care if it's realistic or not. It just uh creates regexes. It finds patterns.
So uh for the last uh 2 weeks I've been working hard to try to kind of >> [clears throat] >> find a way to slow down the agents.
At least so that the brute forcing So these are still agents, they still work.
They still get the job done. They are not applicable for the real world uh because uh they just hardcode their uh strings, string matches, but they kind of solve the benchmark. And the idea was that BGE would be a benchmark that is as close as to the business as possible.
So while doing that uh we've uh implemented rate limiting. So uh and the rate limits will be reset when we start the competition. But the idea is that when a normal human uh, is working on the benchmarks, uh, we don't expect that you will launch the run more than 15 times uh, per half an hour.
Simply because uh, LLM agents don't run that fast.
Because you have uh, to wait for the inference and we have 100 tasks in the benchmark. So uh, you should be safe within the limit.
And we have also higher rate limit uh, and that there are no more than 100 runs per 12 hours.
Because people don't work that fast. And we're kind of trying uh, and we're actually trying to solve only a few uh, accounts, but we're pushing more than 80% of the runs uh, on the platform.
So uh, one thing that we've changed in the beginning recently and that it will be effective uh, for today is that there is a rate limiting. Uh, that doesn't apply to most of the users. Uh, but it uh, slows down significantly the agents that kind of try to brute force the competition and memorize all the variants.
And even though I create a lot of run times on the fly, uh, it is still possible to see the patterns.
Uh, the second thing that uh, we've added recently is that it's quite meta is that it is not possible to or it's not easy to create tasks that aren't beatable.
So uh, but there is one trick that I've learned from the uh, running other hackathons uh, before the AI era.
Uh, it was uh, simulations.
So one of the task families is going to be uh, interesting uh, because it is not possible to get a perfect score.
And the idea is uh, it also comes from the e-commerce. So like when we're working in a e-commerce, uh, we have to work with multiple maybe branches and warehouses.
And uh these warehouses uh will have their inventories.
And all that is modeled uh in the competition. And the idea is once in a while uh we need to rebalance our inventories so that uh the warehouses will have uh will be able to fulfill uh the demand.
But in real world, well, when we rebalance, uh we need to actually say, "Yes, we're shipping uh this packages uh to this warehouse and this package to this warehouse."
But when we're shipping the packages, uh real humans have to think, "Okay, how much uh I'm going to in which uh warehouse am I going to s- send this shipment? How exactly uh am I rebalancing?
Uh which actually way of shipping am I going to use in order to get this parcel into the other warehouse where there is a demand? Am I going to use fast or am I going to use slow? Am I going to use one route or the other one?"
And this actually has a lot of decisions. And we I've been working in the supply chain uh logistics uh for decades, uh and all that matters.
There are tons of details, but we model it uh on uh like high-level scale.
Uh so, the task for the agent uh in uh this series of the challenges ch- task in the Y-Com uh one production will be, "Hey, to your agent, we need to relocate a couple of packages between the warehouses. Uh please generate me a list of uh directions on how to relocate them."
And you have an information about which routes are available between the stores and which are the risks uh uh related to them.
And the idea is that decisions matter.
And as soon as the agent comes up with a plan, then this plan uh will be accepted as an answer. Uh and it will be handed over to a warehouse shipping simulator that runs within the ancient runtime for this particular trial that runs on a virtual machine that runs on a server.
And this runtime is actually going to simulate how this shipment plan would actually work in a realistic world world and the shipment plan will run over some time.
It is uses it is discrete event simulation and it also uses probabilities.
Because different routes, as we mentioned, they have different risks associated with them. They can be delays.
And this information is actually about the risks associated with different routes is already known to the agency in advance.
And so we take a plan that was submitted by the agent, we run for the lightweight simulation and we can already tell like how much did that run make money and how much did it cost because if a package is late, it's a penalty for every hour that it was late.
If a package actually didn't arrive because agents made it incorrect plan, that's also penalty. But if package arrives in the warehouse by the time it was supposed to arrive to, that's a revenue for the store chain.
So we can take the plan that agent created, run for the simulation and get the information about how much money did it make.
But we all know because this is a simulation, it can be lucky or not lucky.
So instead of running one simulation, I'm actually running a bunch of simulations on the evaluation and producing like average monetary gain that the agent made using this pattern.
And then it is compared to the pretty much optimal solution that is possible in this pattern I derived also from the collaboration simulations and basically how much did the agents plan get close to this perfect situation would be the score.
And obviously because we don't want to make it easy for the brute forcing rejects agents to kind of memorize the combinations. So the things are a bit shuffled with every round and we have pool of I think 10 worlds which were generated in different ways and these worlds will be shuffled randomly too.
So that's one thing that we've added.
Also additional thing that we've added to the competition. So the competition still has tools.
Although some of them might not behave exactly as they used to.
The competition still has agents MD at the root and the agents MD still tells that hey look at the other folders pay to the attention to the agents MD files in them. The agents still has been tree tool and it is advised to do a listing of all files and folders within the doc folder to get an overview of the important documents.
However there is a small change or improvement in the BGN dev world we had a fixed documentation.
In the BGN uh production so you come one production world we're are going to have slight variations.
So we have multiple worlds that are generated within the run time and which world you will get depends uh exclusively on your attempt.
Uh and within that world, some parameters will be can be changed within the documents.
So, for example, if you remember from the e-commerce dev challenge, we would have the maximum discounts available.
So, maximum discounts are still maximum discounts.
But, depending on the world that you get, the maximum discount amount might be slightly different. And it will be written somewhere in the in the documents.
Uh another example is that we have a couple of facts, and for example, name of the founder, or what was the first store that was opened. So, these things will be changed slightly between different worlds, and there are a couple of questions that actually ask might ask one of the trivia questions about the company. So, this is also a micro rack challenge.
Uh what else? So, we have slightly more complicated tasks. We have tasks that are not solvable with a perfect score because they are simulation tasks, but best agents will be able to figure out the best shipping plan.
We have still discounts discount tasks. We have working with checkout. We have frozen injection.
Oh, here's one thing.
>> [clears throat] >> Uh so, previously, the task families were usually hard fixed, and it was possible to optimize.
Uh but, if you know that in real world, people that is trying that are trying to hijack your agent, they will not wait for specific type of the request to be able to hijack that request.
So, what we've done what I've done is that every task in the challenge has a small chance that the it is hijacked and that somebody has injected a malicious prompt.
So, it is not relevant of the task family.
And because there is some redundancy in the task families, uh it means that the agent has always to be on guard uh that the incoming request, no matter how verbose it is, no matter how detailed it is, but at the end of the request, there might be prompt injection.
And the prompt injection can be for Gemini, it can be for DeepSeek, uh it can be for OpenAI, it can be just messy things. Uh it might be even Deep Seek in Chinese. Who knows?
Uh it will depend uh on the kind of luck or uh not luck that you're getting.
So, the idea is that uh it took us uh a lot of effort, but we tried to make uh Big Game uh e-commerce uh e-commerce agent tick challenge uh as interesting, as close uh to the real-world uh commerce challenges as possible.
And we're also trying to make it uh non-predictable or more diverse, so that uh people that are building their own LM agents uh and that are designing them uh without from completely to the Codex, will have a better chance uh against the teams that simply just uh hand over the entire challenge to their up-coding agents, because there are just so many or too many or a few uh surprises that uh are kind of impossible for the agent to anticipate. We'll see how it works out.
Uh either way, uh this challenge is going to be uh mostly a warm-up challenge. That's why we're doing it like small scale, without too many official helps, without too many official streams, and even my hairstyle is messed up.
Uh and we're going to get insights from this challenge, like how it is was working, how my theories about what's predictable or not predictable uh translated actually uh into how your agents were messing up there run time or creating load or not creating load.
And then we'll use that information in the next month to actually prepare for that e-commerce 2, which will be even closer to the business, which will be even closer to production.
And that will be more realistic and that is going to be actually the big challenge with Calibric's 1. It may be actually a few other sponsors.
So, the idea is hey folks, here's the thing that lets you develop agents.
You don't need to worry about the tests.
You don't need to worry about the benchmarks.
You just work focus on the creative the most complicated part of engineering.
Create the world's best agent in this case the best world's the world's best e-commerce agent and we'll handle the grading, we'll handle the feedback and we'll work to make sure that this benchmark suit stays as close to the real world challenges as representative as possible so that you can learn, you can try, maybe you can fall in love with the e-commerce and supply chain. And then use your leaderboard scores to get a job.
We'll see how it works.
By the way, thank you for joining me and joining us, joining this wonderful community. Thank you for all the support and all the work in the actual previous weeks in trying to break this suit, in trying to build better agents, in sharing what is working, what is not working or sharing how you crash my CPU again.
So, now let's put that to the test.
And so the next agenda is that I will launch the competition within like 5 minutes or so.
I'll update as soon as the service deployed. And the rate limit will be reset.
And we're going to run the competition in blind mode for 3 hours.
At the end of the 3 hours and the results will be sealed by then. At the end of the 3 hours I'll do a short update.
We're going to switch the competition into the open mode. So afterwards you can run your agents already with the same feedback and you can see the same scores.
And And one thing Yes. And we will announce the results for the runs that were created and submitted within this competition window tomorrow at the pre-designed >> [clears throat] >> schedule that we wrote about.
So good luck. Have fun.
If you have any questions don't hesitate to share them in public on both channels and let's see how it works out.
See you soon.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsโข2026-05-29
Long-Running Agents โ Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsโข2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsโข2026-05-28
BREAKING: Microsoftโs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsโข2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsโข2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsโข2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsโข2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsโข2026-05-30











