This approach elegantly solves the credit assignment problem by decoupling reasoning from evaluation, making long-horizon VLM agents viable on consumer hardware. It is a masterclass in hardware-constrained optimization that effectively democratizes frontier reinforcement learning research.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Training a Tiny VLM to Play Kirby with a New Long-Horizon RL RecipeAdded:
So, in this video, I'm going to be talking a little bit about VLMs, vision language models, and I'm going to be showing how I was able to train one myself, a small 50 million parameter model, uh to play this Kirby game that you're watching right now.
And you'll see how it goes from not being able to play at all to being able to do decently, got over that.
Uh so, watch the end of the video to see how I do this and what the full results are. I saw this paper the other day, and it recently came out from Princeton, and I thought it was pretty interesting in the way that they approached AI agent training.
So, it's called Odysseus, scaling VLMs to 100+ turn decision-making in games via reinforcement learning.
So, it's about basically how to teach vision models to play video games, not by showing it how, but by letting it learn from experience.
So, this headline itself sounds pretty simple, right? AI just plays Mario.
But, that's not really what it makes it interesting. What's interesting is that the training recipe that they figured out, because it turns out getting an AI to make 100 decisions in a row where each one matters is really hard.
So, they actually cracked this with a pretty interesting trick. So, in this video, I'm going to break down exactly what this means, what they did, and what their results were. And then, the latter half, we're actually going to do an experiment ourselves, training this not using Super Mario, but using Kirby Stream Land, one of the classics from the Game Boy era.
So, let's get started.
And if you like this video, please subscribe to my free weekly newsletter, where I give my honest thoughts about the week's AI news that I can't share here, as well as interesting research and papers that I found, and what projects I've been building behind the scenes.
There's a link in the description, or go to onchainaigarage.com.
Every word is written by me, mistakes and all. So, if you're sick of AI slop articles or all hype with no substance, subscribe and give AI Garage a weekly shot. Now, back to the video.
So, a quick background cuz I think a lot of people might be thrown by this. Most people are familiar with ChatGPT, Claude, that sort of thing. So, those are LLMs, large language models. Sure you're familiar with the term.
Um and this is for text, right? Text goes in, text comes out.
You ask it a question, it answers.
So, VLMs are a little bit different, right?
This is the next step. This is a vision language model. Same idea, but now we can also look at images.
You can give it a screenshot and ask, "What's happening here?"
and it'll describe what it sees.
Think of it like this. An LLM is like talking to somebody on the phone.
You can only describe things with words.
A VLM is like a video call. They can actually see what they're looking at.
So, VLMs can see.
That's great, but can they do things?
So, if I show a VLM a game screen and ask, "What button should I press?" It can answer. But, here's the problem. A game isn't one decision, it's hundreds.
And the consequences of each decision might not show up until way later.
So, imagine you're playing Mario and you skip a power-up on turn three.
Everything seems fine, but then 80 turns later you hit a boss and you don't have a fire power you need to beat it. You die.
But, which turn was the mistake?
Turn three or turn 80? So, that's the hard part of it in all this.
So, how people tried to handle this, solve this issue before, so the option one was supervised imitation.
You record an expert playing the game, and then you show the AI those recordings and say, "Copy this."
So, the problem with this is you need a ton of expert demonstrations. It doesn't scale really well, and the AI becomes brittle. It It knows what to do in situations it's seen before. In other words, it's all getting overfit. So, you throw something new at it and it completely falls apart. This is a common issue with training models. This issue of overfitting. It doesn't generalize.
So, option two is short horizon RL.
Reinforcement learning, which is where the AI learns by trial and error.
But, previous work only did this for maybe 20 or 30 steps. So, it's not really enough to handle long horizon tasks.
And neither of these can actually prove an AI can handle long complex decision-making.
So, this is the core difficulty and it has a name. The credit assignment. So, think of it like coaching a basketball team.
Your team loses by two points. Was it the missed free throw in the third quarter that caused it? The bad pass in the first? The lineup choice at halftime?
When the outcome is the result of hundreds of decisions stacked together, figuring out which one actually mattered is incredibly hard.
In a game like Mario, the AI might die on turn 80.
But, the real mistake might have been turn three or turn 15, turn 42.
The longer the chain of decisions, the harder it is to assign credit or blame to any individual choice.
And this is what the Odysseus paper proposes.
So, instead of trying to solve everything at once, you split the problem into two phases.
Phase one.
SFT.
Supervised fine-tuning.
You teach the AI to see.
But, critically, they're not teaching it to copy the gameplay. They're teaching it perception.
That's an enemy. That's a gap. That's a power-up.
Kirby is in the air. The path goes right.
They took around 5,000 game frames and had a stronger AI model write structured description of what's happening in each one.
Then they train their model on those descriptions.
For phase two, they teach the AI to act.
So, this is the reinforcement learning phase, and now it understands what it's looking at.
And it lets it play. Lets it fail. It lets it learn from the consequences.
So, think of it like this. SFT teaches the vocabulary. RL teaches the street smarts.
You need to know what a stop sign is before you can learn when to run one.
Let's get into this a little bit more.
What is reinforcement learning concretely? Touch on this in a couple other videos, but this is kind of core to this paper and to this experiment.
And the loop is fairly simple. The agent looks at the screen, picks an action, the game responds, um and the agent gets a reward signal.
Maybe it's positive if made progress, negative if it took damage. And it updates its strategy slightly.
You repeat this for thousands of times, and it gradually gets better.
Now, the specific algorithm is usually called PPO, proximal policy optimization. And the key idea is that you don't change too much at once. If the AI discovers that jumping works well in one situation, PPO makes sure it doesn't over correct and start jumping everywhere.
Small, stable improvements.
So, here's where it kind of gets interesting.
In reinforcement learning, you usually need two things, an actor and a critic.
The actor makes the decisions, and that's our VLM. That's the model. The critic evaluates situations. It says, "This state looks promising." or "You're in trouble."
The obvious approach is to use another big language model as a critic.
Have it read the game state and judge it. But that's extremely expensive. A lot of the previous papers have been using frontier models, and you need to run two massive models, and it's actually kind of unstable. It tends to break during training.
So, what the the team found out is that the critic doesn't need to be smart. It doesn't need language. It doesn't need to reason.
It just needs to look at the pixels and output a single number. How good is the situation on a scale?
And crucially, it only needs to do this once per game turn.
Not once per word the VLM generates.
That's the turn level versus token level distinction.
And it's what makes the whole thing really affordable.
Cheap, fast, and stable.
So, before we go into this, let's kind of break down of what the cheap critic is. And it's called a CNN.
It's a convolutional neural network.
I'll be mispronouncing that word.
Um and this is actually one of the oldest tools in deep learning, and it's perfectly suited for this job.
Uh the CNN works by sliding small filters across an image.
Kind of like looking through a tiny window that you move around the picture.
The first layer of filters detects simple things.
Edges, lines. The next layer combines those into shapes.
Circles, rectangles.
The next layer recognizes objects.
Enemies, platforms, the player character.
At the end, all that gets compressed into a single number.
How promising does this game state look?
So, that's it.
That's the entire critic's job.
No language, no reasoning, just pixels in and numbers out. And it's tiny compared to the VLM. Fast run, faster train. That's the whole trick behind this paper.
So, let me put the whole picture together with an analogy.
The VLM is like a race car driver. It's expensive, it's smart, it can read the track, think about strategy, can make complex decisions. That's your actor.
The CNN critic is like a cheap driving instructor sitting in the passenger seat. They don't know how to drive as well as the driver. They can't explain racing theory, but they can look out the window and say, "Good line." or "You're about to hit a wall."
You don't need a second world You don't need a second world-class driver to judge every micro decision the first driver makes. You just need someone who can look at the road and give you an obvious thumbs up or thumbs down.
So, that's the separation. Expensive reasoning for decisions, cheap vision for evaluation. And that's what makes the 100-plus turn reinforcement learning actually work.
So, you might ask, "Why do we even need a critic, right?" Some recent methods, like GRPO and reinforce plus plus, skip being critic entirely. It's been kind of a trendy move.
They just look at the final outcome and try to figure out things from there. And for simple short tasks, that works. It's fine. But for 100-plus turns, it breaks down. The reward signal is too spread out. Like trying to figure out which ingredient ruined a recipe by only tasting the final dish.
When a recipe has 80 ingredients, you need somebody tasting along the way.
And the paper shows this directly.
Critic-free methods are unstable and weak at this scale. You need value estimation.
They also add one more trick, which was positive advantage filtering.
Instead of heavily punishing the model for bad moves, which can make VLMs unstable and weird, they mostly just reinforce the good moves. Actions that work better than expected get amplified.
Actions that work worse mostly get ignored rather than punished. It's just a cleaner learning signal.
So, did it work?
Yes, pretty dramatically, actually. So, these are the results of their paper that they found in this task. Odysseus achieves over 3x the average game progress of frontier models like GPT-4, Botson it, 4.6, Gemini 3 Flash, models that are massively larger and a lot more expensive, obviously.
And compared to critic-free RL methods, uh like GRPO and reinforce plus plus, it's not even close.
PPO with the CNN critic is far more stable and reaches a much higher performance. So, the takeaways for long horizon embodied agents, you probably need value estimation. The current hyper around critic-free LM training doesn't transfer to this kind of setting.
So, here's kind of the full recipe and the actual contribution of this paper.
Perception focused SFT on around 5,000 frames teaches a model to see.
A VLM actor uh then picks that action with structured reasoning. Cheap CNN turn-level critic that estimates state value.
Yeah, PPO for stable policy updates.
And then positive advantage filtering to avoid destabilizing the VLM.
And auto curriculum uh to balance learning across different difficulty levels. So, none of these ideas are individually brand new, but wired together in a specific way, they make something work that didn't work before.
Long horizon VLM agents trained on interaction.
So, this was the original experiment. And this is what we're going to try to recreate. Playing a baby Odysseus.
It's the same recipe, just doing it on my laptop. Originally, they had used Super Mario Land, uh the classic. Um and the model they were using was Gwen 3 VL 8B.
An 8 billion parameter model. And they were using research GPUs um that I don't have access to to actually do this. What we're going to be doing is Kirby's Dream Land, an equally good classic Game Boy game from 1992.
We're going to be trying to use small VLM 2 for this one.
Um the much smaller vision model than this Qwen 3 model they were using it. Um but it's what I have to use because the GPU we're using is my RTX 3070 and 8 gig and this is a laptop GPU. I'm going to be running this actually on my laptop not my usual PC that I'm going to. I'm going to use Pi Boy emulator.
Um and the perception perception training uses the same structured annotation approach and the critic is the same lightweight CNN structure.
Everything is faithful to the paper's design just kind of shrunk down to fit on my hardware um that you may own yourself.
So now we're going to be moving into the terminal a little bit and I'm going to be doing this in Codex actually, I decided.
So let's go over there now.
Okay, and this is the spec script that I I handed to Codex.
Uh you can see the goal to build a laptop laptop scale recreation of the core idea from the Odysseus Mario paper.
A vision language actor learns a long horizon game perception focus SFT, reinforcement learning and lightweight CNN critic.
So this is the reference paper like I said.
Uh Super Mario Land they used they used this Qwen 8 billion instruct model.
So this is what we're going to be trying to do.
And this is going to be the core architecture. We're going to start with Kirby emulator frames.
Going to use the small VLM model structured output perception reasoning action action parser then Pi Boy executes the action um which is like a a Game Boy emulator that is run through Python.
Um then you have the frame reward and CNN critic which estimates the value. PPO updates actor policy and then it's just a continued training loop from there basically.
Well, I handed this to Codex.
So, I thought this would be a good opportunity to try Codex's new goal function.
So, I set approvals to be auto review, which basically um Codex will review whether it actually needs my approval, whether something has high risk or low risk.
Seems like most of what it would have asked me to do is low risk.
Um and then you see pursuing goal down here. It's been running now for an hour and a half.
And once I gave it the data, I gave it the ROM uh for Kirby's Dream Land, and I gave it um the spec script, it was able to run it, and currently it was downloaded a bunch of dependencies that it need for the SFT.
Had some kind of bottleneck, um which it had to make these kind of changes, you could see.
Training a format adapter, ran strict live, the um roll out. It built this mixed format in game play SFT curriculum. So, now it's gotten through that. So, that would took up a lot of the hour and a half that we've been running right now. It's just trying to uh like troubleshoot all of this um the training curriculum get a proper SFT run in. And that's nice because I was I was away from the computer for most of this hour and a half. So, I didn't have to sit here and and live through it, which is what I usually have to do with these experiments. You can see it's running the SFT right now. So, let's just see um how it runs.
What we get.
Okay, so that run finished, and you now lost down to 0.012. So, that's pretty good.
Now, it's evaluating strict lenient lenient action success and action diversity on held out frames now.
So, we'll see if it can handle different types of uh frames.
Okay, so it made some progress on that run. Let's see.
Uh but we didn't quite have gameplay level um bolts yet.
Still running different training runs.
The next targeted SFT step is this new data set.
Going to be using this kind of validated right right right A right B progress cycle.
So, it's going to be running that. Um but I'm going to just let this run overnight, I think.
So, I don't know how long it's going to go.
It's at We're at an hour and 47 minutes.
And should be okay if I just let it run overnight, right?
Well, we're going to find out. So, I will see you guys in the morning.
Okay, good morning, everyone.
Uh we're back at the desk.
And you can see I was like 95% sure this would be stopped at some point.
But I just got to my computer and you can see 12 hours, 12 minutes.
It's just been running non-stop.
I think I should stop it.
See what it's been doing for the 11 hours or ever long since I was actually working on this.
Ding ding. It seems like it's working on the PPO phase at least. I think it did finish the SFT.
That's crazy. I'm amazed uh I didn't run into my my limits here. I'm only currently on the $100 plan for a Codex.
And it's been running non-stop for over 12 hours now.
Uh it said take a second to review all the work done. Are you still on the path to recreating a core the core process described described in the paper spec file creating a mini VLM that can generalize and play Kirby better than the base models consider better possible alternatives.
I'm going to try to audit the current reports.
Just want to see if it's right on the right path.
It's been running for 12 hours. It shouldn't I don't know. That seems long for something like this.
Okay, short answer we are aligned with the engineering pipeline of the spec, but we are drifting away from the core scientific target if we keep optimizing scripted routed cues policies.
The specs core spirit does not find the best Kirby action script. It is closer to build a VLM style actor that sees frames, right?
Feature structured action with SFT improve gameplay.
We have done a lot of the pipeline correctly, but the strongest results are coming from search assisted action schedules.
Not from the mini VLM that understands the screen and generalizes. We went a little bit off track. Like he was chasing these tiny games from PPO and you can try to optimize that as much as you like apparently for 12 hours, but that's not necessarily the best use of our time here.
Use of our tokens. Next phase to visual policy generalization.
Uh that is more faithful to the specs core goal and more PPO variants on memorized scripted cues. Yeah, we don't want it to memorize.
Okay, so let's let's take that little pivot.
Okay, so I told it to make that pivot.
Remember the core goal is not necessarily to get the best Kirby playing model.
Um it is to train the VLM based on the process of the paper and to see what the outcome is there.
This was my first time using goal and maybe I was a little bit my prompt was not the best.
Um I think I need to be a little bit more explicit with what I want it to do and when it should give up.
Uh going to Okay, so it's going to be making that move now.
And we'll see what we get from here.
Yeah, I'm still learning how to use goal. I'm amazed it ran for 12 hours.
Um but I don't know if that was the best use of our time, although I was sleeping, so.
I guess we got some some data out of it.
I think yeah, I think I need to play with goal a little bit more to see that like the right prompting.
I need to think a little bit more about how to structure the goal.
Because it really will just run forever.
So, okay, let's see what we get out of this.
Okay, so digging into this little bit deeper, I think the issue previously was kind of went off track um and it kind of had a limited data set.
We need thousands of stills like the paper had I think over 5,000.
Not the hundreds that we have here.
Um and then there was some issue with the labels. A single still may not be enough.
You either need to use two to four frame history, frame and compact RAM state or both.
So, it's still the same principle. We just need more clearer labels.
Um balance actions intentionally.
They need a little bit of tweaking, but the core innovation is not memorizing optimal curvy strip. It is SFT AVLM into a game playing actor that improve it with RL using the visual value of feedback that we talked about.
Um I think that's the pivot we're going to take.
My concern was that the small VLM was just too small to to actually properly execute this, but it says Codex says it's okay as long as we we implement these changes.
So, let's do that and see what we get.
Okay, so we finished the larger SFT on the 5,000 still corpus, and we did get better results so far.
Um you see the assessment was at the larger data set improved the VLM from one action collapse to limited action diversity, but still not a competent life controller yet. The next paper line step is to use this SFT adapter as a stage one VLM artifact and move on to the RL distillation stage with the CNN actor critic path. So, it's not uh quite up to the standard we need right now, but that's okay.
Uh but we're making progress. So, we're going to move on to this RL stage that we talked about.
This is one of my first exper- like ML experiments in Codex, and while it's doing a good job actually executing it, I feel like this kind of stuff is not it's not very good at explaining what it's doing in clear terms for me. I feel like Claude always is a better explainer of exactly what the issues are.
Feels more um just clear and concise in the way it describes what it's doing.
Even though I think just generally I think Codex is handling the actual work better.
Um this is my observation from using this today. Okay, so we did the first run of the CNN policy, and this is kind of uh the results that we got. So, some of the terms here that come up, PPO is the reinforcement learning algo used to improve the policy from emulator rewards. The two things you need to know here are sampled policy and greedy policy.
So, greedy policy is it always takes the single highest scoring action.
And the sampled policy is instead of always taking the single highest scoring action, the model randomly samples from its action probabilities. So, higher probability actions are more likely, but there's a certain randomness involved.
Um and the the policy itself, CNN policy is a small neural network that looks at the game frames and chooses controller actions.
That's what um some of these terms are.
So, before the PPO, the greedy CNN did nothing useful. Sampled CNN did some what useful things.
Um but after PPO, you could see the greedy CNN was still weak, but it definitely did improvement from progress at zero to progress at 518.
Um but the real improvement was sampled CNN, which went from uh 3,117 to 4,808 um in terms of the game progress.
So, the concrete improvement was reward improved from 154 to 248. And like I said, the progress improved up to 4,808. And eight, so.
60% uh improvement in reward, 54% progress improvement. So, that is um the caveat obviously being that this improvement only showed up when sampling actions.
The still deterministic greedy version is still somewhat poor.
So, PPO improved the policy policy distribution, but it did not yet produce a reliable single best action at each frame. Let's see what the next stage will be for this.
Okay, so the next steps. I keep having to kind of reinforce Codex that we're just trying to recreate the paper, and it's okay if we fail.
Uh it keeps trying to like add things extra um in order to get a better model, which makes sense if you're actually trying to create an ultimate Kirby playing model.
Maybe we'll do that, but right now we're just trying to recreate the paper. So, we want to try to stay as close to that as possible, even if it means failing at the end. But, we're we're having some improvement. The next move is going to run more tuned CNN PPO uh from the SFT prior checkpoint, evaluate the sampled CNN policy as the RL policy, and then document whether we see any improvements.
So, we'll move on from there, and we'll see what we get.
Okay, so it completed that uh training run, and you can see the key results here.
The pre-RL S CNN SFT prior sampled baseline max progress got us to 3,117.
In the strict longer PPO best sampled checkpoint, we got max progress of 4,690.
So, this is the cleanest faithful result so far.
At the SFT prior plus CNN PPO involved improved sampled gameplay.
Um so, there are some limitations. The greedy best checkpoint still failed.
Um so, it's not perfect, but the conclusion is that the paper style RL stage does improve stochastic CNN policy, but not into a stable deterministic controller using the setup that I have, which like I said, this is a mini version of this setup using a different model and different GPUs than they were using. So, just to show you what this looks like in the actual game, uh this is actually rendered with the agent playing Kirby.
And this is going to be the sample prior to any of the training, prior to our SFT or the PPO training with CNN. So, show you what it looks like.
So, this it might be difficult to see. It's a little bit clippy.
Um just the nature of the recording.
But, you see you get stuck there.
But, in terms of progress, he gets to around 2,000 uh 400 in terms of progress in the stage. And now look at our trained sample.
So, this one will go 4,400 plus uh move uh in terms of progress.
Might not always be clear, but you can see he's moving along a little bit more.
And that's as far as we get. First, so for both of them, he just went 180 steps, and they counted how far they got into the level in 180 steps.
So, I asked it to try to remove that to see how far we can get with Kirby still surviving. And it saw that the average max progress increased to 5,358.
The total best observed run being 8,760.
Um, so the best run went to 388 steps.
It's quite a bit larger than the original cap that we had just for training purposes. And now I'm going to try to record a video of this longer run so you can kind of see what what happens.
Okay, so this is the video of the extended run.
And let's see how far Kirby gets here.
So, this is just letting it run until he dies.
This is our agent after the full training on a long horizon. This one went 388 steps.
Quite a long uh Doing pretty good. Got stuck there.
See, can he get over this?
Uh, he did it. Good job.
Doing good.
Yes.
Uh, he fell down. Okay, so that's the end.
Uh, but he did pretty good. He got pretty far into the level considering it's just an agent.
Um, and this is a very small model. Like I said, this is s m o l b l m, very small.
0.5 billion parameters, not even a billion parameter model. So, very lightweight model, but it was able to get a pretty good result after the the training we ran through.
So, there you have it. That's our recreation of this Odysseus paper.
Uh decent results.
Little bit of a tricky tricky experiment. Not totally clean, but I think we got decent results in the end and showed how their approach towards training LLMs for long horizon tasks, using kind of the fun example of a Mario game or a Kirby game, um kind of showed how their novel training approach really made it possible to improve on those long horizon tasks.
So, I hope you found this uh video interesting.
Kind of an interesting idea for If you'd like, please leave a comment.
Please subscribe. If you like this video, please leave a like.
And then I'll see you in the next experiment. Thank you for watching.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











