Rhoda AI elegantly bypasses the robotics data bottleneck by reframing physical control as a visual prediction task, effectively turning the internet's video library into a pre-trained physics simulator. This approach proves that the most efficient way to teach a robot to act is to first teach it to accurately imagine the future.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Ep#79: Rhoda AI - Causal Video Models Are Data-Efficient Robot Policy LearnersAdded:
Hey guys, uh welcome to another episode of Robbo Papers. Uh super glad today to have uh three of the researchers uh from ROA AI with us. Uh Tongjo, Eric, and Chan. Um thank you so much for making the time. Uh guys, uh maybe as a start if you guys can just share a little bit about each of you and then you guys can go through the very uh impressive like you know work that you guys have done and put on the website.
>> Sure. Maybe I can start. Uh hi, I'm Tonjo. Um I obtained my PhD from UC San Diego a few years ago. Uh where I worked on uh robot simulation which is basically a human written uh world model. Now I'm at roadi working on learn video model world models for robotics.
So that's accus uh my name is Eric and before roa I was a PhD student at Stanford. My focus was on generative models and computer vision. I was fortunate to do a lot of uh work with Nvidia and I was also a founding researcher at worldlouse and here I'm chief scientist. Um hello everyone my name is Chen. Uh before that I did my post at Stanford uh working on a model language models and before that I got my PhD from UT Austin. Uh during which time I did a lot of work on kind of embodied uh multimodel embodied AI um a lot of you know involving vision and audio and here I do a lot of model training stuff.
>> Okay. Perfect. Yeah. Yeah. Um, so yeah, I I know you guys have put up a lot of uh details in this one single web page.
So yeah, take it away. I I think we we'll listen and ask questions along the way.
>> Maybe I give a very brief overview. We can go through.
>> Okay.
>> Yeah. Cool. I think a good place to start is probably our motivations. I think at ROA we are interested in um general purpose uh robot control. Uh so I think for decades we are like very good at making highly special robots doing repetitive factory tasks but like uh truly general purpose uh robot generalist is still an open problem. Um but how can we solve this kind of general purpose intelligence problem? Um I think we already know a verified path which is a large scale pre-training uh with web scale data and followed by task specific port training and we have seen that this recipe works really well on language and vision domains. So a very nature thought is maybe we can just apply the same recipe to robotics and the problem will be solved. Right?
However, uh I think maybe many robotics and re researchers will agree with me here. Uh the data is a bottleneck. So in order to train those vision and language foundation models, we use all those data from the internet. But for robotics, we just simply don't have that data. Like even if we um ask so many people to do teleoperation to collect robot interaction data in the real world, it will still be a tiny fraction of the web uh data out there. So um at ROAI we believe that uh web video might be the most uh scalable data source uh capturing the dynamic physical world and we also believe video generation might be the most uh effective objective for the model to learn the physical intuition which the robot need to make decisions and uh basically what we did is to reformulate the robot can show as a real-time video generation problem and we call it DVL a DVA uh direct video action model. Maybe we can dive into how it works. So we have a pretty nice uh animation here. So basically uh in our system we have two separate models. Uh the first one is a causal video generation model. What it does is basically taking the past video frames from the real world and generate a video of what should happen next. Then uh another separate much smaller inverse dynamics model takes the generated fe video as input and translate it into real robot actions. Those real robot actions get executed on real robot and then we will add the uh latest observations from the real world to the video context of our video model and then this loop uh uh runs multiple times per second and our system is a fully closed loop system. This is roughly how it works and by this data actually it unlocks a few significant uh advantage.
Uh so first because um so first our video model can directly leverage large scale pre-training. This is because during pre-training stage uh this video model is trying to predict videos and in the post training stage is still predicting videos. So the training objective stays fully consistent here.
And uh the second advantage is that uh for in this uh video to action translation step um we find that this is actually a very constraint problem just like language to language translation as long as we have the right data it's pretty easy to solve and empirically we find that um we only need tens of hours of robot data we can learn a very good inverse dynamics models for a specific robot. So I think um by such a design our system unlocks a lot of uh interesting capabilities. For example uh data efficient task learning uh long context visual memory uh one shopping image learning and interpretability. All of these are very interesting. we can dive into them uh later in this conversation. But I think mostly uh as we mentioned here, our approach offers a clear path scaling because we just have so so many video data out there in the internet and this enables us like um further scale up our model. So this is a key benefit of our uh our system and maybe we can take a look at some uh videos uh video uh demo videos we show.
>> Sure.
>> Sorry. Yeah. Can I ask a question? So I think uh yeah it's really interesting direct uh direct video action model. So I want to sort of understand a bit more here because like there's usually two lines of doing this right. So one is you do like unified world model approach where you're predicting future states together with the action jointly being predict and then there's another line of work which is around the same of like dream gen for Nvidia where they train a video generative model and then they have a separate model that does inverse dynamics to predate the action.
[clears throat] Is this a uh uh not instead of like a separation between that two but end to end of combining predicting video and then doing the inverse dynamics in one single training?
Am I getting >> I think that that's a awesome question.
Uh let's take the two papers you mentioned as example.
>> Uh Jim J and Jim Zero uh chronicity. Uh so f first I think at high level Jim J uh shares a similar idea with us. We both generate future video and use the inverse model to predict action. But I think the key distinction here is that Jim J use this kind of approach to generate training data for robot policies. Then they another model basically takes in the image and output actions. So video model is to you used as a data generation engine. But in our approach video model itself is a policy.
It generate videos and translate back to the action. So there's no there's not a separate uh policy training stage which utilize data generated by video. So this is a key distinction. And another one uh dream zero which is also pretty close to our method. Dream uh dream zero actually does uh the the closed loop control which utilize the video model to output actions. But in their approach they do joint video and action denoising.
Basically that means they use a unified model generates a video and action at the same time. But in our approach we generate the video first and translate to the action in the second step. As I mentioned there are some uh intuitive advantage of this our design. First our video model is pre-trained to predict video and also pre-trained to predict video. So we never change the training objective here. So all the uh advantage we got from pre-training can be uh can be directly utilized in the post training. This is one advantage and another advantage is that we verified that uh the video to action translation is very easy. So we can do it with a much smaller model. We do not need to utilize entire huge video model to do this task. So this >> I see >> what is also first of all I think actually I I think we might have had this exact discussion on on social media at some point. So it might have been might have been you but like but also >> do you think there's any are there like you say that like it's it's advantageous that you don't need to change the pre-training objective. Do you know if there's do you have is there any like what concretely comes out of that is like would you say your approach is better at generalization or I don't or is it like more data efficient or what what exactly I guess >> you know maybe I could uh comment on on this because uh I think the the high level question that uh you know we spent a lot of time thinking about and I'm sure all of you have spent a lot of time thinking about is uh if you decide that you're going to use a video-based approach for robotics There are many different ways you can use that. Uh whether you use it for generating video and then or generating training data or using it as a a simulator to run your policies in or even if you're using it for a uh direct uh policy prediction like how we use it where we are rolling out the video uh and converting that to action in a closed loop. Uh there's a few different ways you could use that. You could use the video as purely for its representation learning uh abilities that you've used video generation as a a pre-training objective uh and then you train an action head on top of that to predict the future actions or you could predict the video and the actions gently or you can do what we do which is predict the video into the future and convert that into actions in a non-causal manner. So when we started uh roto we we spent a lot of time thinking about this and running experiments and trying to validate what the correct approach was for this. Um and some of the interesting things that uh we've encountered uh we've actually uh done a fair amount of research not even related to robotics. So uh one of the very uh interesting conclusions from this type of work is that the order in which you predict makes a huge difference in terms of your your generative model capabilities. So let me give you an example. Uh so we actually there's a a recent paper that that some of us are on called latent forcing and what this explores is different paths in a uh in a diffusion trajectory. If you have the uh jointly predicting uh latent features like dino features as well as pixels. So maybe you could scroll down a bit to uh that figure. Perfect.
Right. So what this diagram shows very clearly is that the path in which you generate can have a pretty profound effect on your quality. So in this diagram, uh the yellow paths mean that you're going to generate dino features first and then you're going to predict pixels next. The blue paths on the bottom mean that you're going to generate pixels first and then dino features next. And the orange path through the middle means that you're going to generate both pixels and dino features simultaneously. And of course everything is held constant in terms of the total amount of compute you use at inference.
Um so uh one of the conclusions from this paper is that the path you take through your your space even if you care about say just predicting pixels or in robotics cases just predicting actions whether you predict video first you predict action first you predict both of them simultaneously can actually make a pretty big difference and of course we see this in many different domains as well from language models where the order in which you output tokens in order to solve say a complex math problem could actually make a big difference. Uh, usually predicting just the answer in one token is not going to be as good as as building out a reasoning trace. So, with this perspective, then maybe we can see why there's a a big benefit to doing something like the video prediction first and then the video to action translation. Because if you predict the action alone and you don't actually predict the video, then you're essentially relying only on um your your action training in order to make that that prediction. You're getting some useful features from this video branch, but you're not taking advantage of sampling in the video and and being good at that during pre-training. If you generate your video and action simultaneously, so you jointly den noiseise them, then you're in somewhat of an intermediate because by definition on the first prediction you make, you you also don't have much information on that conditional distribution for your video. Whereas if you predict video first and then action, this can be beneficial if you expect that you have uh you're more effective at making video predictions possibly due to large scale pre-training than you are for uh at predicting actions directly. So we we've spent a lot of time thinking about this and based on our analysis as and as well as uh uh research that we've done and related research not even in robotics uh we're pretty confident in uh in making decent design choices around here.
>> This is really interesting. Um is there so so let me so just make sure that I have I have an understanding of this. So the argument here is that like video is almost an analog anal an analogist to like a reasoning trace and that it's the correct intermediate representation and the re that that allows you to compute all the right information about the world so that you can correct the right so that you can compute the right robot actions just because we're so good at training video models now is that kind of that's kind of the argument here >> I would say that yeah that's that's one way to uh to think about it so by separating it the problem into future video prediction and then non-causal video to action translation. We're doing a few things. One is that we're making the video to action translation very simple. So, because we're making a a translation, we're not predicting future actions. We're predicting actions covered by a video context. It means that there's no behavior or decision-m involved in that action extraction. It's pretty much just a tracking of points in 3D space, tracking points in the video.
um or given our wrist cameras, it's it's very akin to slam. So uh that problem is highly constrained. You we train a multi- embodyodiment multitask inverse dynamics model and this inverse dynamics model does not really need to reason about what the right decision is here.
It's essentially doing a uh a point tracking problem. All right.
>> And all the behavior then is then baked into the video model.
If you expect that video pre-training gives you better video predictions which we do and has been uh shown before then this is the the scalable path or one of the scalable paths towards uh robot decision making.
>> I have a question. So maybe just understand this figure a bit more. So you are s of predicting with a video prediction objective but you are not actually generating the video. Are you just using the latent s of condition on uh uh on on the inverse dynamics model or are you actually out actually outputting a video like is that is that a case?
>> Yeah, I I would say that's a uh so practically we operate in a lat space and >> yeah okay a lot about uh you know shaping lat spaces and and things like that are very important to get good efficiency with your models as okay.
Yeah. So I think yeah I guess in some degree is close to what Chris was mentioning about trying to connect it as a form as a for a latent reason interest in the sense whereby the video model are really good at you know generating these kind of future states and and then you you hope that you can sort of bridge that with you know learning good actions in in that sense I mean I'm maybe I'm not getting this correct but this why I have a sense of and whereas like the other approach of the couple this is more of like I don't know like if not you guys see like hamsters style where it's hierarchical and you have like you know high level VRM and then you have a action action policy at the end and then you have some intermediate interface I guess like in this case the video or or to learn to generate video is that intermediate interface to bridge the video model with learning go actions if I'm not wrong or maybe I'm wrong understanding this >> yeah right so I think one way to view uh what we've done is you can you can kind of think about it as a a uh a visual chain of thought perhaps that uh the purpose of the video generation >> is to make the video to action translation as simple as possible. So if we think about how far you need to predict into the future uh if you predict it a minute into the future then video to action translation for the next second is very uh very simple. If you predict 10 seconds into the future it's still very simple. If you predict only one frame into the future, then it's pretty difficult because you still need to make uh decisions for, you know, most of your action chunk. Um, >> but if you predict just a a reasonably short amount of time into the future, then you have uh made your video to action almost trivial. Uh, it's a very constrained problem, yet you have a good balance between compute efficiency and and the difficulty of this problem.
>> I see. I'm sure we have lots more questions, but I think is there like we can keep asking questions here, but would you guys like to I don't know, move on.
>> I think you guys should trim me anytime you have questions.
>> So maybe we can go through the rest of the overview then. Um, so as as we've described here, the uh the direct video action model is composed of these two models. a causal video model which predicts a short time into the future in the form of video and the inverse dynamics model which does a non-causal video to action translation and this is executed in a streaming closed loop so it it runs continuously.
Um this causal video model is actually trained inhouse from scratch. We develop the architecture um and we do that inhouse because there are simply no uh video models out there that are strong enough and fast enough to do what uh we needed it to do. So we're I'd say one of the few robotics companies that you know truly focuses on pre-training and that's where a lot of our effort goes into. Um but native training on on video like this gives a lot of advantages. One is this uh scalability that as you get better and better video models um you you scale up your compute, you scale up your data uh we you get better video predictions and the maybe one of the beauties of the system is that if you have a perfect video model if you have an Oracle video model then you also have basically a perfect robot policy in that it's the video model that's making all the decisions. If you make a good video prediction for any task in any situation, then you also have basically physical AGI. Um, [snorts] but what but we don't need to get to there for it to be useful because every time we make our video model a bit better each subsequent round of pre-training, uh, we get more data efficient. It's a a steady ramp from where we are now where we can run these models closed loop to the true zeroot uh do any task. So each time we improve the video model we we feel like we're improving its capabilities improving its data efficiency and robustness.
Um training natively on video has other advantages as well. For example um it makes it uh very easy to debug the policies. Maybe you can scroll to the to the bottom.
>> That actually that's one of my favorite part.
>> Yeah. So what these are are these are autogressive video predictions. So um we just generate a few frames at a time and then feed that prediction back into the model and continue generating. And these are what four minutes 5 minute long video predictions in some cases. Um so the the stability here we we feel like is is pretty impressive. your end to end visualizing how the the model is going to complete a long context task.
And this is very useful because we can get a good sense of is a model good enough? How is it going to perform when we run it in the real world? What are the failure modes? So this is especially useful for debugging behavior. For example, uh did it reach for the right object? Did it draw the right uh drawing? Uh is it doing the steps in the right order? Uh so uh this makes evaluation a lot easier. We can just uh generate hundreds of parallel generations on a on a cluster without even needing to run it on a real robot.
>> Uh uh get sorry I I have a quick question. So um you know you mentioned you guys focus a lot on pre-training right and basically taking advantage of this uh web scale video data. Um I think to most people when they think about web scale it's it's is YouTube right? And YouTube is I would say mostly exoscentric right it's not egocentric u kind of data whereas for robots typically you see this kind of risk camera you know like egocentric uh so and I also saw on the website you guys have like uh pre-trained on something like 300 plus years equivalent of web scale uh video data uh so my question is does it matter like [laughter] if you just uh you know does the view where you take the video in your pre-training matter or you actually want a lot or views that is much closer to what a typical robot will see.
>> That's a good question. So, here's my perspective. It depends on what type of information you're trying to learn from the pre-training. And in my opinion, the most important and and most difficult information to learn is is really it's decision-m. uh like >> how are you going to handle something that is a bit out of distribution to what you've seen in your your robot data. And if you care about decision-m and you care about say modeling how the world works and in in this case you're maybe handling rigid body objects and handling some foldable objects. It doesn't really matter if this is captured from a a video of a robot doing it or a human doing it or even egocentric uh video. Physics of how cloth moves is going to be the same whether it's captured from the wrist camera or it's captured from a tripod. Um and decision- making like whether a it's a person say flipping a box over uh or uh solving a puzzle or a robot doing it. it's there's a lot of shared information there and as long as there's shared information this is something that that can be transferred from pre-training. So I think that um you know there's certainly if we had access to uh you know in terms of pure like sample efficiency there's certainly nothing you can do better than um than having more class specific robot data. uh but if you care about uh kind of cost efficiency then you can get a lot more ancient scale data and and data that is egocentric or exocentric and at scale that is very important and the diversity is very important >> I have another point to add I don't know whether you guys see that meme on Twitter like in this year's GDC there are just one billion companies selling egocentric data I don't think it will be about an act very soon Yeah, one one point I also want to make is uh I think one big difference between how we use uh video data and how some other companies use video data is um so as far as I know a lot of these uh companies use video data you know they will collect data specifically for some of the task and they want to make sure you know the hands are always uh in view and they can track the hand and you know do VI training and stuff but our video model the goal is to learn general physics knowledge about the world so any video is useful uh you know the knowledge of uh how things move uh how you know if I uh you know if I not uh nudge on this bottle and it's going to fall this knowledge was shared across both humans but also robots and um so yeah we don't have to specifically curate for egocentric videos but we do also have a pretty decent presentation uh of egocentric videos in in the training data >> for pre-training I think it makes sense to have the model learn physics but do you ever see so you're you are these industrial tasks like like here. Do you ever see objects that are far enough out of distribution that it doesn't quite work that way and that you do need like and then you would need like an industrial task captured like a specific one? Well, I'm not to be honest. I've never seen a uh one of these like Contico containers in any of our pre-training uh videos, but it's also >> uh you see enough general data and uh you know that just getting >> basically the argument with 300 years of data maybe you see everything I guess is the argument. I think there's there's where we are right now or where we were, you know, a couple months ago and where we're going and the we certainly want to get much much more data than that into our our training pipeline. We're excited about this approach not not only because of what we've shown already but because of the potential for expansion.
Uh and I think that one of the uh one of the lessons in pre-training, post training, transfer is that uh if you want to do well at some type of task downstream, then you probably should have seen everything in some form in the pre-training whether you've seen similar objects or similar motions or uh similar environments, similar physics. So certainly the the models have the ability to compose some of these things like for image generation the uh astronaut riding a horse on the moon was like the classic example because when when Delhi first came out or when the first image generation models came out uh you can be pretty confident that no photo had ever been taken like that. um but it did have a a good knowledge of each of those individual components composed. So here ideally we want to see uh all of the main components in some form um or at least the distribution right and I think that's what what pretty much everybody is trying to do with with pre-training in in all of robots. I have a I have a question about pre-training especially for like you guys mentioned that pre-training from scratch right so you're training this uh I guess like world action model from scratch and then I I'm I'm I'm curious to hear that because >> you know people pre-train VAS from a pre-trained >> uh you know VM and the VM is prem and then it captures a lot of semantic knowledge of the world right so you you know for example if you give a image of you know Donald Trump versus Taylor Swift or like you know Jensen versus someone else you Google always love to do these kind of demos and you know he knows all these things. I guess learning from video you learns a lot about actions you learn about a lot of temporal consistency you learn about you know memory or whatever. Yeah. So you learn all these things but do you learn things that is required to do reasoning?
I mean I'm I'm just curious like do you learn if you if you don't train on any of this multimodel stuff and just only on human videos or doing stuff will you be able to you know do one of those demos where it requires like you know give a math problem and just solve it by you know choosing the right cube. I mean like they have some of these tasks these days but I'm I'm just saying that in in that context you know would would this be something that's important if let's say you want to scale this to beyond just doing a task of motion but really be able to think through some of the things or even like I don't know read labels and and do actions based on reading some of these labels.
>> Yeah. So maybe I can I I can talk about uh you know what what is possible and what is practical and where we are now. So certainly uh one of the reasons to that uh you know we're we're excited about this causal video prediction objectives because uh to do very well at this or to be perfect at future video generation really does require the model to have uh very deep understanding not only of physics and 3D structure but also reasoning and behavior and to predict both uh say a follow around down the hill and how that's going to continue moving, but also say a person cooking and which objects they're going to reach for uh and what they're going to do next. certainly requires a a very deep understanding of uh of how humans behave and and of course you can extend that to reasoning to say predict a video of how a chess player is going to move the pieces requires a a lot of intelligence and and uh video generation can be thought of as a uh in many ways it could encapsulate a lot of the other modalities as well including for example text generation.
Now, practically, I think that large language models and text is much further along in terms of its reasoning capabilities, but we're not, you know, we we care about the uh the goals here. And while we bring attention to using video generation as our main objective, it doesn't mean that we're um completely against using text.
And especially uh most video models are are text conditioned anyway or are tightly integrated with language. So we absolutely expect to utilize language reasoning as a control signal for our models. There's no reason why uh we can't take advantage of any of the capabilities of uh of powerful LMS and VLMs today.
Um but in addition to language reasoning, some of the exciting things that we have been uh been exploring with our models is visual reasoning. So by training on a on long videos during our pre-training phase, we naturally imbue our models with the ability to handle long context. So quite unlike most uh BLAS which often have a context of only maybe one or a small handful of frames, our models have uh you know natively several hundred frames of context. And that allows us to do long context tasks like this shell game here where you need to uh track an object across many different swaps and then be able to identify which cup uh the object is under. and what the object would look like. So this is an example designed to to show one of the benefits of using long context. This is I would say more of a demon like a kind than a simple demonstration. It's kind of contrived example but a very real way that we use long context is um maybe you can scroll down to the next example.
Sorry.
>> While he gets it back, how much robot data did that take? Uh Oh, I was asking how much robot data did that take while he brings the website back.
>> Oh, for the for the tree show game.
>> Yeah, for the for the cup game.
>> I I think that every example shown in this blog used less than 20 hours of data.
>> So, makes sense.
>> I I don't know exactly how much this one used, but um at the point we had the blog, we had done no task with more than 20 hours of data. Here's a a more practical example where you would use a uh long context. This is a returns processing task where there's many different steps and some of the tasks are or some of the steps are visually ambiguous. Uh I think the easier way to explain this is like involves like uh unfolding the t-shirt then inspect then refold the t-shirt. So when you f uh unfold the t-shirt and refold the t-shirt, it will you will enter into like very visually similar state and if you don't have a long contest memory, you will be confused which step I'm at right now. For example, um we can we can jump to here. So once this robot fold the t-shirt, the next step should be put this t-shirt into the back. But because this model has very short uh content memory, it just forgets where am I right now and it's start to re unfolding the t-shirt. So this makes the the task like failed. This is a basic idea, right?
>> Yeah. So this is uh >> yeah, this is a this is a really cool example task. Um >> yeah, I think this is actually a very common case in many tasks, many non tasks, you need to remember where am I right now. uh you can definitely use some like um hierarchical uh structure to make this happen but like because our model native supports long uh visual memory where we just don't need to do anything special to support this feature so everything happens natively >> I guess we're deviating a little bit from the topic D was asking about reasoning yeah I think the VM does have good spatial or you know semantic reasoning knows what object is what and you know reason about the process of doing things but doesn't have like spatial or dynamics knowledge. I think that's just fundamental fundament fundamentally lacking in a language driven training and by doing you know next frame or next video prediction the model naturally has like the better understanding of the physics uh spatial and you know dynamics.
>> Yeah.
>> So and also I think two abilities are complimentary. It doesn't mean we just do spatial reasoning. I mean to me I think uh the endgame is a multimodel model. you you both have the semantic reasoning from the language models, but also you have the spatial and dynamic reasoning from uh these video models.
>> I I didn't have a question, but I have just a quick question. So, I noticed that the arms that you guys are using is the same as Genesis. So, he has pretty good force talk. Are you guys using the force the sensors in any way for some of these tasks?
>> Uh yes, all the models have the uh joint torque as input.
>> Okay.
>> So, that's just uh the robots data.
That's from the so this is the inverse dynamics model or or do you do you include that in post training for the video models as well?
>> That is input for the video models as well.
>> Yes. So so both our video model and inverse model take uh robot prop perception as input which includes the uh talk sensing.
>> Do would you say that helps quite a bit for some of these tasks like the force talk specifically? I mean prop perception I get it but like the the force talk part. Do you think it helps quite a bit with some of these tasks?
Yeah. Sorry. I would assume that uh I would assume that even if we did not have the force torque inputs, the model would still be capable of doing the tasks. The uh evidence for that is that telly operation does not have force torque inputs.
>> We use VR and the telly operator can't feel how much force they're applying. So if the teley operator doesn't need to feel how much force they're applying in order to do the task, then you should be able to do that with the model as well.
We do give the well inputs but I wouldn't say confidently that this is a critical piece in our robust current performance.
>> Got it. Yeah. In the interest of time you know one of the important message we really want to uh get across in this blog is how data efficient uh this model is. U this is in our title and also you know one of the uh important section of uh this blog post. You know we we showed two industry tasks. One is decaying the other one is cont breakdown. And for both tasks, we only use in total of, you know, 10 to 20 hours of row data. In total, there was no the the video model was first time posting for this embodiment. And with this amount of data, the model was able to pick up this very long horizon complicated tasks and being able to run continuously for hours uh you know to uh do a task. And this is huge especially for uh industry deployments. you know we are talking about uh running in this industry environments meeting this 99.9% success rate uh require KPI requirement and really delivering you know real values to the to customers yeah I think maybe it's a little bit hard to conduct a appletole comparison to other models but what I heard is that many va models uh requires like more than 1,000 trajectory to trend but like our model is definitely trend with only hundreds trajectories and it's pretty pretty robust.
>> So this is 99% success. Is that the is this the 99% success rate task for this one?
>> It depends on how you define the success. If you the definition is like you just complete the task without getting stuck. I think our model is even more than that success rate but our model still makes some small mistakes.
Yeah. So it depends on the definition.
We were able to take well nothing unreoverable I guess.
Yeah. Uh I guess so for both tasks we actually took it to a customer site and wrote you know a pro concept. The customer have a defined KPI for what they think is you know uh uh they can something they can accept and our model was able to you know uh meet these uh KPIs at a production standard uh you know continuously for hours. Yeah, I was I was actually gonna uh say something about like I know somewhere in this blog post you guys have this uh almost like human demo because that that's enabled because you have the long context. I I think that was really impressive like >> I think it's it's so so it's uh yeah I want to make sure that you guys show this as well.
>> Could could you talk us through a little because this is not like the collect 10 you know hours 10 20 hours robot. This is like just show it one time and then the the thing do it by. Could could you talk us through this a little bit? Well, I guess uh so yes, I guess a a typical way of uh you know learning a new task is always doing television, you know, collect a lot of data and then train um train the model to do that task. So one thing that's new with a video models is the video model can learn from a video uh demonstration in the context, right?
Uh if uh the model has ability to learn that uh from uh the video demonstration in the context, then you know it can pick up a new task very quickly. And this is what we want to show here. I mean it's kind of a proof concept. Uh we did collect some pair of data of uh you know human demo and following video. But for all these test cases they are new.
There are you know uh different objects different configurations of beans and uh positions of arrangeable mode object.
But uh we show that the model is able to you know by just looking at the video of how human do the task either pick up an object put it in a bean or later uh doing a drawing and follow that uh to generate the uh actions correctly.
>> Yeah I think a requirement of uh this kind of capability is that your model should be able to consume very long visual history otherwise you are just not able to do this because not everything can be described in text.
Actually we have another example here.
So basically um the the the human draw something and the robot will draw the same thing in the whiteboard. This is something you just cannot describe by language. You have to achieve this by visual memory. Yeah.
>> So the input of the model in this demo is the video of the human's action or what the input >> human video not human action. Yeah.
Yeah. It will stay in the context. Yeah.
>> I see. So you you collected maybe some primitive motion demonstrations, but then you you're using the video to steer the policy towards doing this, I guess, in that sense. Okay.
>> I love this. I love this example. So did you did you specifically train for this or is this just an emergent property of training on your many hundreds of years of data?
>> No, we we specifically train for uh for both of these. So it's >> uh it's not an emergent property. uh >> but we feel like that uh >> it necessitates the pre-training in order to do well >> but but all those test uh test case are new like they are not in trained they just generated by the the user only fly >> I think I think this is really cool I think like people have been trying to steer pre-train models steering video is very interesting yeah >> can I ask why is it not erasing the whiteboards >> good question I think initially for data collection We did try to uh erase it but for some reason later it was just uh maybe it was because it's too long was out of context so we didn't erase uh yeah >> yeah I think we we pretty much covered all the important capabilities here data efficiency memory interpretability yeah >> yeah that that's one thing um I I know you guys also cover I think you guys call it leaprock inference because like practically >> uh you know yeah video generation takes time you know so I I think uh yeah you have this very full uh diagram that show could you talk us through a little bit how how you guys practically do it?
>> Yeah. Yeah. Sure. Okay. So, so the the the underlying reason of the the entire design is basically uh model inference takes time. If model inference happens instantly, we don't need this complete design at all. But it just takes time.
And when your model are doing the decision making, your robot still need to execute some action, right? It it should it should not like just wait there and wait for your inference. So that's why we first we need to overlay the model inference time and the the the action execution time. So what happens here is like so maybe let's look at uh this block. So here is a robot action and this robot action will be executed between tx1 to t uh and t =2 and at the same time our video model is trying to uh make prediction for this vision chunks and this vision chunks will become the input to the inverse model and the output of the inverse model will be this action chunk. So you will see when this action chunk is being executed our entire system is making the prediction for the next action chunk. So once this actions are finished the next action chalk is ready. So this is how this system works and this is basically in other words asynchronized um uh asynchronous uh inference and what we did uh what specially uh uh so we we did something special here basically we we will use the action chunk which has been uh executed at this moment as the input to our entire system. Basically video model will know which action is in being executed right now and inverse model will also know this. By doing this we can avoid the different inferences are making very different uh predictions.
Let's say uh for example we have multiple objects on the table. The model can uh decide uh which object is want to pick up. But in two uh consecutive inferences the video model may make different predictions. it may try to uh pick up different objects and this will result in oscillation during uh action execution. But what we did here is to um use the current action as the input to the video model which tells video model this uh decision you make last time. So this time you you you should try your best to make a consistent decision. So that's how we uh resolve the oscillation problem. Yeah, this a basic idea and we call this leap frog uh inference. Does this make sense to you?
>> Yeah. Yeah. I I'm I'm curious like in the case of let's say um I mean right now most of these task the the environments are pretty static.
So I'm wondering let's say uh if if you're operating in a you know multi- aent like there's another robot that's also moving around or the humans that are moving around h >> how how do you think this will behave like because I imagine a video prediction I don't know we almost have to encount like predict what the other guys are all doing is like you know the most crazy is like maybe in a in a team sports right like football match >> it's it's just a matter of like just squeezing the robot actions to be very very like you know very very short and therefore like you you don't um you don't go super out of distribution if you're playing I guess Messi the fellow was in front of you like you know half a millisecond later the guy the guy is gone.
>> Yeah.
>> Yeah. So maybe I can comment on this.
>> Mhm.
>> In general there's a a trade-off between how accurately you predict into the future and latency at which you predict into the future. So if you have a very poor future prediction uh you might not get stability if you predict very far in the future but you can compensate for running that prediction very quickly. So for example in in like classical control if you you can make very strong approximations in your dynamics but as long as you run with a very uh tight closed loop that can still give you stable behavior.
On the other end of the spectrum, if you have a very accurate future prediction, then you can tolerate a a longer horizon almost open loop. So if you if you really could perfectly predict how the future is going to evolve, then really you only need to make one inference and then execute that forever. Um in practice uh we try to make as good of a future prediction as possible but there are still uh many things that are are simply infeasible to predict. For example uh the behavior of external actors coming in and and moving objects around or if things fall that's a that's chaotic.
It's very difficult to predict exactly where the objects are going to lie. Um so because of that we both want to make a as good of a prediction as possible but do so with reasonable latency. So how quickly do you need to to predict it will depend on the task. If you're dealing with static objects then you could probably tolerate a longer horizon. If you are dealing with dynamic objects uh or you you have much higher position requirement then you need a shorter horizon. Rough ballpark numbers are that human reaction time is about 200 milliseconds shared across all all humans from from myself to uh you know professional baseball uh player. Uh a if you're playing baseball, you actually need to make a prediction for how you're going to swing the bat almost just as the ball is leaving the pitcher's hand. Um so uh as long as you can stay within uh say a couple hundred few hundred milliseconds you're somewhere in the ballpark of human reaction time and if you can go faster than that you can be even better. Um and if you are much slower than that then you may be limited in terms of tasks. So uh that's one of the big motivations for our our system that we had to design and architect our video model from scratch. Very different design decisions than are are made in typical text of video models because we needed it to be good enough but also fast enough to run multiple times per second.
And that that influenced a lot of the decisions we made along the way. But we think that um you know these models have to be designed in in almost a fundamentally different way if you want to if you want to make them effective for robotics.
>> Yeah, I think that's also a reason why we choose to pre-train the video model from scratch ourself because we can have tighter control all the component and it becomes much easier for us to optimize everything. Yeah. Could you I don't know if this is too much detail, but I'm curious what kind of what kinds of architecture changes enable you to run this really fast or that's going to stay a trade secret for now.
>> I think we might reveal more later on.
Um >> Okay. Later on. Okay. We'll keep our eyes peeled for that.
>> Yeah. But I think what a very important piece is that your your video model must be cod because if your video model is birectional attention then you you just not able to you reuse the KB cache and you need to recomputee everything from scratch every time you gener so I think that that's that this might be the most important design choice but there are not so many causal video model out there so we have to train >> I guess people people try to distill birectional video model into causal model but I think as result the quality does not seem to be as good. Uh like you know we showed for this uh uh authorized generated videos we can have a few minute long uh video generation without clasping uh but if you look at some of these uh uh the cell uh causal models they can uh you know drift very quickly uh if they just keep generation. So I think that's also another huge benefit of um the causal model doing causal uh model during pre-trading instead of doing a birectional prediction and then to sell into that in a post stage >> and what you're talking about here for is is like people like dream zero for example who are building off of WAN right >> I I would say that um I think that the truth is you know one's a very impressive model but it was developed first for textivity generation. They prioritize different things.
>> Very pretty pixels, not accurate pixels.
>> I would say in no in no way are our models more aesthetic than than one. We operate at lower resolution. Things are um are worse in terms of the say pixel sharpness. But the benefits that we do get >> are in speed and in terms of we we care a lot more about things like structure and decision-m and uh you know that that's what we have optimized for.
>> I really want to ask more about like how how you actually get it to care more about structure about what your like losses or architecture are but I won't push on that. It's clearly not the right thing to ask right now. [laughter] I did want to ask one other pre-training question though if we if we have another qu because uh so so you you there's all these ways that pre-training is a big advantage but you're also talking about things like force and torque become and preception being introduced later so those are obviously not part of pre-training uh but like do you think that that do you see a world where at some point that's going to be important or do you think that like do you when do you introduce things like this like the the more robot specific stuff like which action you're going to do and which like how general is that across tasks is that part like mid training or I I don't know. It's a >> Yeah. So, I guess I' I'd say that there's um you know there there's what you would like to use in an ideal world and and there's what's practical. So, um to be clear, we also want to collect more robot data. We want to you know the getting more robot data is a it's definitely a valuable source both for having task data but also having um other sensor data uh interaction with the environment and so on. Um so don't consider us to be like we're trying to use uh no robot data at all. We're trying to use as much robot data as we can and also a huge amount of internet video scale to augment that and and I think that most people even folks who are using um you know primarily telly operation or primarily say collection approaches would also tell you that um webcale components whether that's the language models or that's image representations or other pieces of the network that have had the benefit of web scale training are still immensely valuable. Um so in a perfect world we would actually have internet scale robot data with all the sensors and interactions. Um but we'll we'll get as much of that as we can but practically most of our data comes from uh we only have the video streams. So you could have say uh you could have a million people going out and collecting robot data for you but it's still difficult to compare it to 7 or 8 billion people on earth always uh collecting more more video data. So no matter what, we will always have more video data than and robot data uh or physical interaction data. And luckily, most behavior can actually be captured in the form of video. And even lots of force data can be captured in in the form of video. example, um our rippers here uh kind of deform as as they uh apply more force because they're somewhat squishy. And that that means that the amount of force you're applying is visually apparent. Uh also kind of a a side effect of these uh these specific arms that we're using are that they have a torque sensor in every joint. And each of these blue rings illuminates according to how much force is being applied. So that that was not something that we intentionally uh put in. But it does mean that some of the force information is apparent in the video stream as well. Also I guess even if we only have you know very limited amount of raw data and preerception data but in practice we see that conditional generation you know conditioning on these uh signals uh to generate the video as long as there is you know enough correlation between the conditioning signal and the output the video model can pick that up very quickly. So even if you know there's not so much data but in practice you know there's not a learning issue in picking up these like uh actual sensor modalities. Yeah, tora rephrase what John said, >> like making effective decisions is a very difficult problem.
>> Correlating video and force or or or video and tactile sensing is is something that's much more constrained and it can be picked up force data.
>> I have a question. So I think on the website he mentions that you know pre-training 300 plus years of webcale video data. So is the pre-training video data actually just internet scale data means like I don't know like cats playing with boss kind of data or is it like you know egocentric with human doing task kind of data like like cuz yeah here says 300 hours so is it like are we talking about like YouTube videos from like cats and dogs kind of videos or are we >> the vast majority of data is general internet video data of people of cats of you know any type of thing that you could would imagine. Um so >> we're not just training from data or or >> they do like video data but end is all you know in bear there's a lot of people wearing the caps and putting camera on their hand and collecting those data and call that video data because I think video data should be like internet skill data should be like you know cats and dogs kind of data I don't know whether Chris and Michael agree but you know that kind of data should >> well Michael collects cap data so Maybe you should watch your watch your words there, Jeff.
>> Yeah, my my data. [laughter] >> Yeah, to to be clear, we definitely believe egoentric data is very good, but like we still utilize general web videos.
>> Well, yeah. I mean, uh well, maybe I should say something since you guys don't know. I mean, I don't I don't want to show anything that I'm I'm doing, but um uh you know, I as a the way I look at things is like as a layman, like I I think it will help in my mind to to have some ego data because ultimately that's if you apply on the robot, especially a robot that moves around the the only vision that you have is a is a camera that you stick on the the head of the of the robot. So, um um I mean, I say that like like you know, I watch a lot of football. I'm a huge Liverpool fan, right? So, uh, and I realized watching a football from TV, you know, where the camera is looking at down the field versus when you're on the field, uh, is completely different. Um, okay, partially because my my my body is breaking down already, so I definitely cannot play, but you know, just I think view where you put the camera actually, I don't know, intuitively to me makes a difference. But I could also imagine a case where you have so much data that is coming from all sorts of different views. Obviously, it doesn't have to be just ego. In fact, maybe only 1% is ego.
The rest is like 99% is it's from all sides of camera. You probably will just generalize to every view anyway. So maybe my own personal opinion is only what I learned in my lifetime. So maybe it's it's too small sample size. Clearly this is already 300 years, right? It should be thousands of years not millions of years maybe. I don't know whether it's do we have millions of years but maybe we do already do.
>> I'm sure millions of years.
>> Yeah. Yeah. public >> one thing I'll say is uh by by training on webcale data even all the YouTube data doesn't mean you have to uni uniformly sampling a sample from the data distribution right you can still you know bias the data distribution towards uh or you kind of do way adjusting on the uh videos that you think are important I mean this is still a very much kind of active uh research uh project that we're explor uh exploring here but um yeah >> when also you also have both oh I was going to say both ego and exocentric data, right? Cuz like you've got the endector cameras and you have the third person view.
>> So I think we on this topic random like super random question. You guys think that you say someone since birth start wearing egocentric cap and you know the thing on their hand maybe in just 20 years you will get enough data that you know that's relevant to train at a human level dexterity and manipulation. I mean like you know think about it like if it's just 300 hours of like internet skill then maybe it's not. If you have the whole experience of a person till the age of 20 just capturing videos, would that be enough to train a journalist model that can manu play and interact with the world? I mean just a random question.
>> I think one important thing to keep in mind is it's hard to quantify what is good enough like are are language models today good enough or are you always trying to push further in terms of capability. So uh I think that especially with now there being a lot more egocentric data collection the metaray bands other types of glasses or cameras in them there's going to be a lot of passively collected egocentric data uh you know in the coming years I think there's going to be a lot more robot collection and handheld uh gripper collection as well um but you know how much is is good enough and in general you're always trying to push towards higher comp you know, higher generalization, more complex tasks, um starting to have better decision- making in these models. And what that means is you're always going to be trying to gather more data as well. So, no matter how much data you have, uh you're probably also going to benefit from having additional data outside of that.
One I think uh important thing to keep in mind as well is that generalizing from only one single case just does not typically work very well. Uh so like generalizing from only one camera view to a completely different camera view uh may not work very well. Generalizing from one single environment to a complete separate environment does not often work very well or at least you should be surprised when you see positive effects. um having a lot of diversity in the data. No matter what type of task you're working on, robotics, computer vision, language, no matter what application, uh diversity is, you know, even more important than scale and seeing lots and lots of things is uh is just yeah that that's a critical component if you want generalization.
So can I ask how do you quantify diversity or is there any way to quantify it in your mind like what what makes good diverse data for your application for any application? Yeah.
So um one thing I like to think about is like how much model capacity do you need in order to model a data set and because I I think it's people talk about data sets in a way that uh and and often makes assertions which are not completely true in terms of like it's it's not always true that a higher dimensional data set is more complex to model or a uh larger data set is more complex to Uh let me give you an example of that.
If you uh an image data set that is uh very high dimensional and very large is if you took pictures of faces across everybody on Earth and you try to generate images. But it turns out this is actually one of the maybe relatively easier data sets to model because it actually does not have that much diversity in it. So when we we talk about how much capacity you need for a generative model, I like to think about it as what is the complexity of the underlying data manifold. And with that perspective, diversity is a really key part of uh of the data set. You could have if you have a million hours of uh doing a single task, that's very different from uh a million hours where every sample is something completely different. So in terms of diversity uh I'd say that we use a lot of the techniques that are are shared among other types of general models from uh from dduplication to trying to extract features and cluster data and I I don't think I'll go too much into the details here but I imagine it's also quite similar to how how people are are are curating and analyzing data sets across any other domain. So fundamentally I guess so fundamentally uh a good diverse data set in your mind for pre-training whatever is just whatever is hardest to whatever is hard to model because of because of diversity so you can so you can properly uh so so you have diversity increasing your training your final training loss or something like that. So the like I guess there's no real but but I guess no real metric am I understanding right or trying to trying to boil it down into >> well I think you can you can develop proxy metrics for how diverse uh your data is but um in general more diverse data is more useful and more difficult to model it or requires >> uh yeah >> you know more model capacity. Um yeah, I guess I guess maybe to to wrap things up here. Um what's next? Uh I know you guys briefly mentioned also in your blog post. Uh yeah. So what's next?
>> Yeah, I think to be honest, we're we're just getting started on this. We have spent a lot of time building out model infrastructure and and and training pipeline and getting models that are fast enough and strong enough to run on on the robots. And everything that you see in the blog post was essentially our first attempt is the first model that was fast enough and strong enough to be uh that we were confident could run on a real robot. Um and since then we've been putting a lot of effort into improving our pre-training, getting more powerful models, longer context, um and adding more capabilities from different ways to control the models to improving reasoning and planning.
uh to working on uh more dextrous control as well. Uh we also have uh a hardware effort that uh is a bit earlier but we think is going to be uh an important component.
>> I think another thing that definitely worth attention is evaluation. I don't think any robotics company or robot lab a good benchmark or standard for evalation. This is something we are also putting a lot of effort and if we can have a very good evalation definitely makes the iteration much faster.
>> Are you talking about evaluation sim >> um like in general including sim world model and real robot evaluation and operations.
>> Yeah I would say I think the video model uh for robotics are just getting started. um there's like so much more to be explored um you know other different research project that Eric mentioned and uh scaling uh how to push the you know the video model to next level and also you know do multi multimodel uh models you know combining the for example the semantic uh reasoning ability from language models uh yeah so there's a lot of things to be explored also one more thing I want to mention is we are also hurrying well you know if you're interested whoever is listening if you're interested Please apply uh on website.
>> Uh yeah, of course uh you know [laughter] uh well firstly you know thank you so much for like you know coming on the pod and and then put putting out this uh you know like this big blog post and uh yeah when I first saw it I I was like oh damn. Especially the the the one that impressed me the most was the the the kind of one short human demo thing. the it really showed the long cont.
>> Yeah, I thought that was this one was really cool. Um, >> yeah, maybe maybe as a as a final question, we always ask this uh to to wrap really the whole whole episode up.
Uh, are there any like papers that you guys read recently that you know was really interesting you want to give a shout out to? Um, yeah. Doesn't have to be, you know, like a fo generation. It could be anything related to robotics that kind of pe your interest.
>> Yeah. Yeah.
>> Um I think that in in journal models I actually quite liked uh the JIT paper.
It's a very uh simple idea. It's the type of idea that I think everybody feels like, you know, oh, I should have just thought about that myself. But it's it's thoughtful and well analyzed. And I'm I'm sure it's going to be very impactful going forward in terms of uh trying to to analyze what these models are good and bad at.
>> Uh DJ, how about you?
>> Um maybe you go first.
>> Yeah. Well, I feel like uh I'm not read so much paper recently, but I feel like the J was the last one we did a deep dive. I also really like this one. And also you know uh our our another work you know we just I want to show earlier the latent uh forcing paper was built on top of this paper that was also uh was a read. Um yeah I think so far my favorite one in the past 12 months is is the the LBM paper from TRI. Maybe that's not very new but I think I always get something new from uh reading it.
>> Yeah. Okay. Cool.
Uh yeah, I guess uh thank you so much again for making the time. Um thank you for all the sharing really impressive work and yeah for those listening in looking for a job maybe could you know could apply uh and and join the team. Um yeah I think looking forward to your next big release. Um thank you so much for making the time again. Yeah really really enjoy this uh this whole episode and all the sharing. Yep.
>> Yeah sure. Thank you so much for having us all your >> Yeah. Thank you. Bye.
>> Yeah. Yeah. Of course. Yeah. Yeah, it's uh it's definitely our pressure. Yep.
Related Videos
Beyond Robotics | European Rover Challenge 2026
beyondrobotics
189 views•2026-06-01
Beatbot Sora70: JetPulse Technology and AI obstacle avoidance and navigation!
DroidModderX
26K views•2026-06-02
Tesla FSD 14.3.3 Hits Phoenix Streets - FIRST LOOK
anthonystesla
114 views•2026-05-29
Elon Musk Just Revealed Fremont Line for Optimus Gen 3 Mass Production
TheAINexusOfficial
180 views•2026-05-30
人機一体「零式人機 ver.2」 子ども企画【おもしろ発見!モビリティー】 #乗り物 #automobile #robot #shorts
KyodoNews
1K views•2026-05-28
China’s New Luna AI Robot Looks Shockingly Human...
NextGenHumanoids
850 views•2026-05-28
Reachy Mini: the $300 open source robot you can actually hack — Andres Marafioti, Hugging Face
aiDotEngineer
662 views•2026-05-29
柔軟指×AI画像処理食品の仕分け作業システム!#柔軟指 #ロボット #自動化 #製造業をもっと盛り上げたい
KiQ_Robotics_Corp.
113 views•2026-05-28











