Ted Xiao provides a definitive roadmap of robot learning, masterfully distilling a decade of complex breakthroughs into a clear, three-act evolution. It is an essential synthesis for anyone trying to distinguish fundamental structural shifts from the noise of current AI scaling.
Inmersión profunda
Prerrequisito
- No hay datos disponibles.
Próximos pasos
- No hay datos disponibles.
Inmersión profunda
Ep#78: Three Eras of Robot LearningAñadido:
Hey everyone, welcome to another episode of Robot Papers. Today we are super glad to have Tatsia uh joining us today. Tata was a long-term uh member with with Google Deep Mind for the longest time.
Uh yeah, finally, you know, it's good to get Ted on. Uh I don't want to put Ted on the spot, but honestly like Ted is what is the main guy who got me like did this deep into AI research. Oh, thank you Ted if you still remember years ago when we met in Coro. Um yeah so Ted thank you so much for making the time I know you prepared some slides um but maybe for those who don't know you yet maybe you can share a little bit about yourself first and then you can go through your slides. Yeah, of course.
Yeah, thanks for the kind intro yet again and I deeply remember uh New Zealand uh you know 2022 with this guy Michael just shows up and we started chatting and I think the world has changed so much since but I think I'm so honored to to be on the show and uh yeah I think um briefly I think you know I uh I've I've been I was at Google DeepMind for a bit over eight years uh where I was working on endtoend robot learning starting out and I think well today's kind of like brief overview is kind of like a a journey through how robot learning has changed. A lot of it is kind of from my perspective of like seeing the different trends and the ideas that were kind of rising and and and waning. Um but I think it's it's been such a wild journey in the last decade and even in the last few years or few months the world in robotics and AI is unrecognizable. uh I've worked on quite a few different topics and I think throughout it all though I think uh you know what inspired myself and and many others to join the field of physical AI it's so beautiful to finally see some of the what was once unattainable now actually coming to life and the even the scrappiest demo video that we see today uh would have blown all of our minds you know just a year or two ago we would not have believed that this would ever be possible so I think it's just very fortunate I think to to be at this point in time where we're pushing on such cool technology together as a community. But uh happy to to jump in right right in then.
>> Yeah. Yeah. Let's go. Mhm. [snorts] >> Uh so I think you know going back to the beginning uh there there's there there was in 2015 and 16 there were these really cool papers uh DQN and and AlphaGo that were coming out which showed the power of applying end toend datadriven methods using rein deep reinforcement learning to solve the games of Atari video games like Pong or mastering the game of Go with with with with with deep Q learning. Um and [snorts] at that time this was so impressive because the generality of these methods that can improve with their own experience was really phenomenal. And then there was the kind of crazy idea of like well when you look at robotics you know could this also be an environment a marov decision process that could be solved via the same datadriven methods and if you looked at the state of the world at the time there had already been very performant hardware systems. what we see here are are videos from Stanford um from decades ago of you know this is teley op and sped up but it was you know bimmanual mobile manipulation systems that were able to be controlled by humans to do a variety of household chores and so the question became what if we married the kind of algorithms that were powering solving video games and go to this real world system where the bottleneck at the time seemed to be fully the human intelligence that had to remote control these kinds of systems Um that was kind of what got me into robotics in the first place and there were so many great collaboration and groups at at Berkeley at the time um where I was a student uh Peter uh you know Sergey's lab um and where some of these ideas were still starting to form and that's when I first joined uh Google Brain at the time on on this new robotics team uh you know less than 20 people that was just trying out what if we just did the thing what if we just tried to collect in one place an arm farm of you know 10 cucka robots and just have them pick stuff up 24/7. And we just tried to do online real world reinforcement learning. What might that look like? Um, and so that puts us I think roughly the where I'll cover three periods of times uh that I think where a lot of the energy and excitement in robotics went through different phase shifts. Um, we start out with existence proofs. We'll talk about what foundation models means in robotics. And I think we are now clearly in an era of scaling that goes up to today. Uh and I'll talk through some of the top hits um from each of these eras. And uh you know we we we can kind of see what what was changed and some things will still feel familiar today and some things will look very archaic. In the first era of existence proofs, I would say this started out with the arm farms that I mentioned that brought me to to to to Google brain and I think brought many others that were interested in this crazy idea of this isn't traditional industrial manufacturing with modular tamp systems to task and motion planning systems. Uh this was truly end-to-end learning where we just have robot arms that are in a room 24/7 just trying to pick stuff up trial and error and then collecting a offline data set that's continually growing from online experience and seeing whether or not we could do policy learning to learn from that experience. Um this kind of scaled and you know what if we happen to have multiple tasks that we want to learn. Uh and then later on we we expanded beyond the bin beyond the lab.
the first era here. There are some works I'd like to highlight. Uh the the most famous one from our group at the time was called QT opt. Um the difference between robotics and standard go or video games is the fact that it's a very highdimensional continuous action space system. Robotics is hard. It's messy, right? Images, image observations that are coming through at at relatively high frequencies. um and and where your your the the dimensionality of your state and action space are just way larger and and how do you adjust the methods value based RL methods that worked for Atari that worked for Go and how do you make them amunable to this kind of large scale robot learning? Um QT was one way we did that where we where we uh you know used the the cross entry method as an approximation for the the the maximizing uh you know the Q value during your Bellman update. Um, this of course did not happen as just an algorithmic innovation. There's also a whole system that you need to set up from scratch. If you're going to do this 24-hour ARMFARM data collect style, you need to set up a good evaluation system.
You need good of course scaling in the real world and you need a good control stack. Uh some parts of this were for example on the right here uh thinking about how we go from this like uh you know think pause observe where you where where you bake in network inference and the world pauses and then you need to recomputee and rerun inference and then you execute again and you repause. We want a world where things are smooth and fast and so you actually have to think while acting. Um and and so this is what we call concurrent RL. There was also challenges in evaluation of course where uh this was back in the world back in the day when you know maybe we were still in the early days of bullets and and and and then you know before like you know a lot of the latest and greatest kind of GPU accelerated physics sims we but we still wanted to leverage simulation for data generation for for co-raining or for evaluation and so one way to do this we wanted to close the domain gap so that policies whether they were trained on sim or real data could still be inferenced in a real looking world where if you dropped them into the real world, there wouldn't be a huge domain shift, the distribution shift between the types of images that it looked like. So we trained a cycle GAN to turn simulated images like the ones you see on the on on the left here into realistic images like the ones on the right. This is when GANs and style transfer and computer vision was kind of the best best kind of image generation approach. Um so that was a fun you know four or five years of really getting these systems working at scale and you know what that kind of led to was like okay this this end robot learning kind of works what's next right and and so what's next was going beyond one single task of arbitrary pick and place into can we actually do multitask learning more complex scenarios this was a era I think of where the QT off system the scaffold was solid and And then you could try out a lot of interesting research. Um you know the we we could start exploring imitation learning with BC0 the the first large scale multitask language condition imitation learning policy. Uh there's a lot of interest in combining the benefits of imitation learning with reinforcement learning fine-tuning. You want the bootstrapping from expert data with with with with behavior cloning but you also want the properties of reinforcement learning where you improve with online experience. Uh there's a whole line of work on that. there was the idea of can we actually shove multiple skill capabilities into one single set of neural network weights uh with MTOP the multitask extension of QT opt and of course there was the idea of like what would it look like to really scale to a truly unbounded set of primitives and behaviors um this is work called learning from play where we really studied the you know took the idea of hindsight experience relabeling goal conditioned relabeling where you take rollouts and you take the goal that ended up happening and you pretended like you meant to do that from the start. We took this to the limit with uh play data where where human experts just collect arbitrary useful data without any you know clear minute short horizon goal in mind and then we just re label that using latent plans. This was a a very fun few years of a lot of interesting kind of uh kind of exposure to a lot of new ideas. But I think things really honed in uh when we kind of took a step back at all of these kinds of like variants of bin picking or or or different algorithmic exploration and we bet really big on the data itself. We kind of took a year and a half pause uh where we invested in you know uh you know fleet of almost 10 robots, 10 operators uh experts uh that are taking um you know uh expert demonstrations in a variety of hundreds of different tasks just collecting this offline data set. um and and and been preparing for you know a bet to kind of train policies that could truly solve thousands of tasks in this micro kitchen setting. Um and in this year and a half we didn't publish any papers on this data set. We're just collecting kind of you know slowing down to speed up. Uh and this is the the data that um later on you'll see resulted in in so much interesting research. And so there's this period where it's like okay we we've done all the bin picking we've done the RL the IL imitation learning explorations. I think data is a bottleneck. Really good data and stable objectives like supervised learning. How do we get there? Year and a half pause where we're just focusing on scaling up this new teleyop regime. Um this kind of work uh you know then uh kind of led to what we call the era of foundation models or what I'll refer to as that.
But I'm happy to take a pause here if you have any kind of fun questions or comments on this first era of existence proofs.
>> Yeah. Yeah. I I I think the obvious one would be like obviously hindsight is 2020 but looking back uh like do you think that for example that one one year and a half like uh soal like slowing down to speed up uh was it something that could have been done earlier or was done at a larger scale like what what would you have done differently or was it just the right time to do that like >> yeah that's a really good question you know and I think one I've also you know thought about a lot um there there are a few things that make this a bit complex.
Uh so one is of course idea that you need a large diverse data set uh you know to that that's you know data quality is high enough you you you hit this critical mass of an offline data set and then it's it's actually very productive to learn from using large scale limitation learning. I think this was potentially more obvious in domains like vision or language in robotics though it wasn't clear yet right this was still an era dominated by reinforcement learning where like online on policy feedback was still viewed as king where tabular raasa exploration right a lot of works were were studying the exploration problem and and a lot of works were were trying to use imitation learning not as the full solution but as by just the bootstrapping solution there was this kind of phenomenon that we're seeing you know with with with with imit with bcz0 and other works that like oh yeah averro cloning is interesting but it only gets you to 70%. To go from 70 to 80 to 90 to 95 to 99% you need reinforcement learning. This was kind of the belief that we had at the time. And in fact, you know, my first exposure to a lot of robot learning was from the first ever um robot learning course at at Berkeley taught by uh John Scholman, uh Sergey and Chelsea. Yeah. And and there in fact, behavior cloning is the very first lesson, the very first lecture is is like, oh, this is a nice idea. Let's take supervised learning, which works in language and vision, and what if you just try to do that for behavioral trajectories? Oh, this is nice, but you get compounding errors and you get all this thing. And this is why behavior cloning isn't a good idea. This was like the the lore. This was the the belief in 2016. And then so like BC was viewed as this like simple toy thing that only works in toy problems. And anything serious, you need RL. You potentially need real world RL. So that was kind of why I think it took us some time to go from the RL to the imitation learning world. But I think the other part too is that the operations, the hardware, like setting up that like teleop data scaling stack that it's kind of standard today >> was not obvious back then. The idea of putting, you know, million-doll arms or, you know, those Yeah.
>> into a room and just collecting data and then hiring people to manage them and and and engineers and mechanics to maintain them. This was like, you know, you're just burning money with no kind of end in sight, right? It it's very non-obvious. And even the hardware was changing, right? Going from the Cuka to partnering with Everyday Robotics where the robot hardware form factor of this one-armed mobile manipulator, it would change every few months, right?
Different generations. The hardware is changing with the controls, the firmware, everything is changing. So, it's very non-obvious that you could just lock in focus on high quality expert data and with the with the faith that it would be useful, immensely useful for supervised learning. This is something that had to be proven.
>> Yeah. But why do you think that G?
Sorry.
>> Oh, go ahead. Go ahead.
>> What what if you can recall like what gave like the the guys in charge that the conviction to hey maybe RL like this real world RL like exploration you know this go in the real world thing um may not work. uh why don't we just double down on like this uh you know just collect real world data teleop data because that that is a very big switch right like what gives you guys that conviction back then to basically slow down everything cut everything do this other thing that used to be chapter one like you say of a you know of a cost um >> yeah yeah I think there's there was a few things that was like the perfect storm right I think we were seeing where RL was working but we were seeing diminishing returns where like to climb time from you know I let's look at this awf plot right so so so this is kind of looking at the the yellow lines are like BC imitation learning baselines the blue lines are RL RL reinforcement learning but I think the general trend is like we kept seeing RL curves that would go up and to the right that's great more experience better improvement you know that transfers to real world evaluations but it was like so complex when you're managing these distributed systems that are part SIM part real and then policy checkpoints you know, they might get stale and like how do you mix the data and track it and maybe there's there's some bug in your controller that you change and all of a sudden the data is like, you know, not very useful. RO was painful. The fact that you had to these were continuous living beasts that you had to kind of train and manage over weeks or months.
And then imitation learning kept selling out at 60 70 80%. It just wouldn't get higher, right? So it's like well we we kind of like the fact that it's just very turnkey imitation learning and it it gets you to the 60 70% but reinforcement learning is just it gets so painful and increasingly painful as the teams the operations kind of grow in complexity right and so it's kind of like this is clearly untenable neither of these worlds seems a good fit so we entered a a period that we called a code a code yellowish yellowish right so code red if you know maybe you've heard like you know AI companies open has a code red when this new puddle. Google will have a code red when something breaks.
It's like everyone drop everything.
There's something we need to fix. Code yellow is like, okay, you know, maybe it's not like existential, but it's like, wow, something's really wrong. We should invest some time to pay down tech debt. This was kind of like, let's pay down some research debt. So, it's code yellowish. Uh, Carol, my my old manager on my team, uh, coined that term, but uh, like, you know, so it's code yellowish. This BC world that caps at 80% isn't great. RL which is excruciatingly painful and complex. This isn't it either. Can we figure out a way where we're able to train on offline data sets to leverage highquality expert demos as well as kind of reach the 80 90 95% performance that we need. Right?
That's all we want. We want something that that gets better with more robot data. Uh that that can consume more robot data and can actually be performant. Right? And I think this was a co-design of like investing in data scale up as well as algorithmic exploration. And so I think the the the the period here where we exit all this branching complexity it kind of all consolidated when after various bug fixes for refactors complete overhauls u you know one of our colleagues Yaoo you know he's an infra god he kind of rewrote the entire trainer from scratch and all of a sudden BC just started working. was no longer plateauing at 60 70 80% it was hitting 90 95 like and and then then at that point you know the data was clear we we exited our code yellowish and it's like okay large scale imitation learning we're hitting 90 95% and we are improving with more data robot data of course this is expert demonstration data it's not autonomously collected robot data but it is more real world data so it has scaling properties and it is performant this is what we wanted we have a recipe imitation machine learning, let's go. And so that kind of gave us the confidence then to really scale up the kind of investigations we were already doing with demonstrations, but you know, let's scale this up 10x. Let's keep pushing for a year, year plus, let's solve thousands of tasks in a micro kitchen.
>> Question.
>> Yeah. like do you do you feel like there's like a deja vu cuz like you know if you you've been through this era of like you know you know u sort of like seeing the proof and then in today's context a lot of this idea that you see like hey it's like pretty similar to something you know bad people dying before I cycl gang example you gave you know make me think about simpler you know with the you know the co the coar control makes me think of real time chunking so these kind of ideas of you seems to sort of reinvent over and over again. I mean not the same idea but you know with technology getting better people are able to do this thing much better. Do do you have that deja vu you know going through both of this era?
>> Absolutely. You know, I I think old is new. Research is, you know, history doesn't repeat, but it rhymes.
And uh I think maybe one secret is the robotics community is very results based. And I think it's it's not that people are out of a vacuum coming up with a new idea, although often it feels like it is, especially when we are we have a nice idea, and then you realize actually some smart people have been trying this every single year for the past decade, right? like like like literally they are trying this exact idea maybe framed differently with different vocab but they try the exact same idea and it only becomes popular when it starts working and I and I think you know this is actually a meta comment of you know I think there's a lot of desire in the machine learning community before to share negative results but I think it's really tough because there's so many reasons some an idea could have failed especially in robotics where your your your full stack goes all the way from hardware all the way to algorithm right there's so many ways it could be a skill issue could be a bug out of your control whatever So the only real like kind of progression we make as a community is with successful ideas. That like the only way something succeeded is if every single piece went right. And so I think we basically try the same 10 ideas every year with some variation and some sometimes some step changes. But then you know a lot of the very exciting ideas things that we now call world models or uh what we call learning on more general data distributions using internet data with VAS using YouTube data using human data. These are not like you know completely crazy ideas that we are only thinking of today.
These are ideas that have been around for decades right and it just so happens that today for some of these ideas at least you know f and I'll I'll maybe we can talk a bit more about this later.
Maybe the time's finally right. Like I I think there's uh you know maybe is a is a hint for foreshadowing. Um we had this intern project for uh using egocentric human data you know training on ego 4D back in like 2023 and it was actually passed off between like four or five different interns who would basically spend three months, six months trying to get this to work. Could not get it to work. Switch to something else. hand off to the next intern who would take off like like you know and very very very very brilliant researchers too at you know like Carl Pur and and and many and many others who kind of did their paid their due in the trenches at Google DeepMind trying to get human plus robot learning to work and uh I think maybe maybe 2026 is now the right time but like this has been tried almost continuously at Google for the last three years and a lot of it just didn't pan out too early not enough data ego 4D actually isn't a great data set for robot learning. This is like the open secret now, right? Um but maybe some of the new egocentric data which is collected with robotics in mind with better hardware with better sensorized human collect. But yeah, anyways that's foreshadowing. We can well I'm sure we'll get there.
>> So when do you think this thing should stop? So like you think about it like uh you know people are sort of I would say reinventing but always doing some new version of it. Right? If you look at Yumi for example, Suran did a similar project during her PhD that's very close to what Yumi looks like. I guess back then there's like you know some missing piece but it's it's pretty similar like but in in machine learning people sort of stop because you develop transformer and transformer works so well you know like that's where we know that we have to stop because nothing else that comes after that is going to be you know better than that but in robotics we haven't seen that right do you agree on this or do you think in robotics for some parts of it we have seen that transformer moment >> yeah great question I I feel like you know just practically speaking progress in machine learning and AI I goes in like spurts, right? like like I I think we like to think of ourselves as researchers, but oftent times I think in AI it is really so interdisiplinary and built off of the great work and contributions of the entire community that we oftent times you know thrash between explorer explore and exploit right like like as soon as there is a recipe that works well and I guess here maybe this is we can think of this as a funnel right like when QTOP works well that's now a system that you can do a ton of interesting research when we decide right after our code yellowish that this kind of end to end like imitation learning is a good recipe we can go and take that and later you'll see you know VA is maybe one kind of tool that you can then exploit over um and and same now like transformer is a is something that you can then exploit and you know with with with your point of like you know maybe the the the UMI grippers or or other things like that I would argue that like it takes immense activation energy to prove a novel idea but once that idea is proven the community in robotics especially is very quick to jump on something that works and like quickly double triple 10x the performance just through you know optimizations using it in clever ways um but those like big step changes right I think those happen more rarely um and it and it and it's not clear to me at least if there's a clear pattern right there's often similarities but you know I don't know like things like using diffusion policy you know diffusion as as as the as the as the technology technology to represent such a multimodal action distribution. Um, action chunking, you know, the fact that I guess I'm I'm just showing kind of Tony's great work now, but you know, like Aloha is a good puppeteering system. Um, using kind of video action models is a new one or egocentric human data. It seems like when there is a proof of concept like this, when there's an existence proof, um, you know, like the like the story of the 4-minute mile, as soon as the four-minute miles run, oh, then overnight, a lot of people else will will run the 4-minute mile and they'll find further optimizations and then maybe after a few months, a year, there'll be another breakthrough that people can quickly, you know, all kind of catch up to and then branch from there. And I think you know robotics is and AI broadly is just a series of these you know explore and then consolidations and then you explore there again. And I think right now in robotics we are definitely in a explore world where there's a lot of very interesting ideas being explored potentially we'll see some consolidation uh you know potentially in in in the coming years that that will be exciting.
>> Yeah. Thanks. I I think we should continue. You're not this.
>> Yeah. Yeah. Yeah. Let's get to the bro.
Yeah. of um I think I and here you know I think we're we're um approaching kind of again the in 2022 I think but you know maybe 23 we start to see another branch of works and I think it's the perfect storm of where foundation models outside of robotics start working well right the chat GPT moment you start to see language models like pong or VLMs like poly start to really work well from from Google and of course other models outside um and I think from here you have this like kind of like alien technology, right? These language models that have trained on the internet that are showing true generality and and emerging capabilities. And you see robotics now moving to, you know, from an online world to an offline world, right? Where you go from online experiential real world experience that's, you know, collected by policies to this world where we have humans collecting large highquality offline data sets and then you use supervised learning. So these worlds are kind of merging closer together and I think it's it was a perfect storm for when we can think about how we can leverage foundation models and how we can build our own robotics foundation models.
Um I think you know broadly you know for RT1 at least right this this was a collision of let's take what works from ML scaling outside of robotics let's take the models that already exist the pre-trained representations the foundation models we can just take them as black boxes um and then then let's take the large scale data sets that we're collecting notably like know the year and a half of the slow down and speed up we collected something like 87,000 trajectories of robot manipulation. um what happens when we combine these into a recipe. Um and and the first thing there was this this work we called RT1 robots robotics transformer one where we take where we kind of use tokenization discretizing uh you know kind of robot actions and and and and languages into tokens everything's a token token in token out and of course there's some kind of tricks you have to do in order to get these PLMs to run at three hertz more efficient you know kind of vision encoders some kind of token compression but you know at the end of the day it's a it's a transformer 5 million parameter ters, you know, pretty large at the time, especially compared to QT opt and and and we kind of see uh you know, what happens when you have this uh tokenized input, tokenized output running at three hertz train on this 87,000 tractory data set. Um and it turns out that it was actually quite general, can do quite a lot of things and kind of blew all of our prior baselines in our previous behavior cloning, you know, ResNet 18 baselines out of the water. And you know, I I was beting right where we don't just want to create a new foundation model from scratch. We also want to marry it with the foundation models that existed at the time. This was Seikhan was you know I think uh one of the first really really popular works from our team where we we kind of took language models as planners and then we we melded them together with what they're good at maybe the planning the common sense reasoning but then we kind of made sure that we could ground them in what robots could actually do. Um so we kind of took the language model predictions for you know well how would you put an apple on the table? I would blank robot planning and then we kind of weight that against a a a a value function that we learned that represent affordances what a robot could actually do in a given situation. And by waiting what the language model thought was right with its planning and reasoning with what a robot could actually do via the value function, you get a plan that is both reasonable and achievable.
This was pretty cool because you know this actually technically came out before T1. So it was kind of like the first time we had shared the results of our year and a half of investment into a large scale imitation learning pipeline.
It was the first contact between language models and robots. Um you know it was a shiny Google paper that was you know I think this is where we learned that uh Fay was actually a brilliant you know uh cinematographer taking very polished production value. He definitely did a lot to raise the bar for I think a lot of robotics future releases which everything now needs a shiny video with with that's very well done. Um but this was a hit for you know it's shiny language models. It's it's a very performant manipulation system. The first time we released this kind of micro kitchen multitask environment. Um it was the you know right place right time right idea. It had something for everyone. It had the the AI bling edge.
It had the robotics. It had the the transition to a imitation learning setting. So I think this was a really fun paper. I think that really kickstarted a lot of really cool work on our team at GDM.
>> Yeah, exactly.
>> Yeah. Yeah.
>> Okay, cool. I didn't >> Yeah. F was you know running around and filming everyone and you know scripting and it was like and yeah this I think you know was became you know when this was the front page of the New York Times we all printed it out and like you know had it around the office. It was like a really cool moment of like wow like maybe this foundation model thing is actually maybe there's something here.
>> Wait so this wasn't made by Google's uh marketing team the the video the original can >> Yeah. Yeah. That that was you know I I think a lot of the B-roll was from Fay.
I think we we we had some support from the Google Brain marketing team to put it together. I think a lot of the inception of like oh yeah this would be a cool launch. We've kind of, you know, coming out of COVID, we had a year and a half of putting our heads down like like, yeah, this would be a cool, you know, way to showcase where the robotics team wanted to go. We wanted to go from language to actions. That was kind of the the tagline of our team for a while.
We wanted to do thousands of tasks in like, you know, human centric environments. So, this is this was like a perfect storm. The timing was like it couldn't have been better, I think.
>> And this is roughly roughly when Chetchup PT came out, right? Uh I think when you guys published was it after before?
>> Yeah. Yeah. I think this this was um maybe uh a few months before Chad GPT.
Um but I think it was like a lot of the lessons at that point, right? Like maybe some initial scaling laws from GPT, you know, 2 and GPT3. Um a lot of that was interesting. Um it was either one a few months right after chat GBT or a few months before but it was it was right around that time you know when the LLM were really getting good enough to do interesting things. Yeah, like this was we were using you know palm as our is our um kind of um there but you know this was one way that we could use off-the-shelf language models for robotics but there were then many other ways right we could use it for data augmentation this was a work called called dial where we kind of took the idea of synthetic data generation with VLMs they would take our 87,000 kind of trajectory data set and then use vision language models to kind of reabel it kind of like the hindsight labeling but in language space to re label like a millions trajectory data set um that we could then train on and all of a sudden we get much better language generalization. Uh the original 87,000 uh you know trajectory data set for RT1 and for Sean was only 500 tasks. Um of course in reality you maybe want to condition you want to generalize to many many more tasks. That's this is one way you could use VLNs.
Um, but taking this idea even further, you might ask, instead of just using the VLM as a planner or as a data annotator, could we just use it as the backbone itself? Right? I think this is kind of obvious in hindsight now, but at the time it was like, oh, these models are too big, imprints will be too slow. Um, you know, it's not clear that this internet training data is actually useful. Um, you know, we should we should really try to restrict what the LLM are. They're just a black box. we'll we'll pull them into our system but not use them as the policy itself. And so, you know, when we came up with the idea of a vision language action model, VA model, you know, I think we're all re, you know, the audience of this show will would be very familiar with VAS by now.
But, you know, we we we'll co-ra, we'll turn robotics action prediction into a V a VQA task. Um, and we'll see what happens. And we trained on models from 5B all the way to 55B, which is of course way larger now than the Polyjama based open source VAS we use.
But it was an interesting, you know, uh, proof of concept at the time and and we just saw tremendous amounts of emergent reasoning and generalization that was once again blowing RT1 out of the water.
This was also the era of you know looking at collaborations uh where now that we do have once we've tasted scale right it's addicting and it's like okay we scaled the backbone we scaled the components in our system we scaled the robotics data we collected in-house maybe we should also scale the embodiment can we train on data from arbitrary robot embodiment and there's the now you know quite famous open cross embodiment project where we collaborated with you know 34 different research institutions in just saying, "Hey, you've already collected data for your project. How about we just consolidated, put it in a unified format, open source it, and then anyone can use it, right?
We just make it accessible." We ended up training both RT2 and RT1 backbones against this cross embodiment data set.
Um, and it turns out that some of the skills and and and the behaviors collected on one robot could actually transfer to another robot. um especially for things related to you know um language or or or the descriptions of like pushing near versus onto versus inside like those are things that started to emerge when we did this cross embodiment training.
Um, and we also were able to kind of study research questions too. Now that we had this amazing tool of this offline data set, the 87,000 trajectory data set, we had backbones like RT1, RT2, we could now then start tweaking kind of and optimizing these design decisions.
Um, one area that I personally was very interested in was like action representations which could help you generalize to, you know, out of distribution scenarios. Um, you know, whether we we we on the left here, you know, the default data set has this very simple synthetic like structured text that is like, you know, verb like like like noun verb noun or something like very simple. Um, but is there a way that we could expand that? And and we looked at quite a few different ways to do this with things like bounding boxes or segmentation masks with, you know, kind of chain of thought where you kind of break down long horizon instructions into shorter horizon instructions um within the policy itself. You can look at affordances, you can look at trajectories that could even be specified by egocentric post tracking.
um a lot of very creative ideas and all accelerated by the fact that you did have this initial backbone of RT1 or RT2 and you could you could quickly like try out new ideas and see the properties that would emerge. Um, and I think this really highlights the power that you were getting to earlier of like what when is when when do you when does something work? And I think things really start to work when you have that kind of critical mass of that starting point, right? Once you have when we do have a very good baseline in RT1, RT2, it's very easy then to try a new novel idea and make very fast progress.
Yeah, I guess once again pause there for the era of foundation models. Um, yeah.
Yeah. Well, I I I guess this is uh I mean the foundation I mean your your your error the way you define it is still relatively recent. I don't know whether it's too early to soal look back now. Um I have a I have a question.
>> So let's say imagine you see whatever progress today and you go back to that era of you know foundation model. What will you change? What what would be something that you you guys thought of at that point in time and you'll be like oh like after knowing whatever happened today you go back there and say no let's not do that or you know maybe you do something different. what what would be changed?
>> Wow, great idea. Great question.
I think VA is an idea which could have happened at least a year earlier.
[snorts] Um, we already putting a lot of time into creating RT1 from scratch, right? Like all the design decisions, right? The the film the film conditioning and set across attention, the the efficient net as a vision encoder after trying out different ones.
um trying out you know token learner for the compression like like a lot of that we were kind of building from scratch all of these components from neighboring fields and and and and trying to create like this Frankenstein ourselves that was like you know designed for robotics from the beginning um and of course across the field there was a lot of other trends like let's use pre-trained visual representations like like really nice ones like R3M make it easy accessible but it was this very like modular you you only piece the you take the minimum you need from other fields no more and And I think with the VA right it's like no let's just take everything let's go to them let's take the training stacks the infra the pre-trained data sets the co-raining let's take everything from our peers in vision and language and then let's just make the minimum epsilon change needed to make those methods work for robotics right as opposed to doing this all from scratch I think in robotics we have a tendency to try to do everything from scratch because that's kind of like we're on the hook for the robot doing the thing if you have complete control it's It's a lot more flexibility to make gains anywhere you want. But I think VAS took kind of a leap of faith, right? You just trust that the VLMs are smart and that, you know, there's some things once you take the VLM black box that you can no longer look inside it and change it, right? And and and it's kind of scary making that leap, but I think we potentially could have made that leap a year earlier. Um but uh yeah I think a lot of other ideas you know that maybe we'll talk about that came later like you know um switching to bmanual or dexter dexterity or you know thinking about robot pre-training and post-training. I think it was still too early back then, right?
Like you didn't have these base foundation models. You didn't even have these like 87,000 trajectories on a one-armed tabletop manipulation setting was was was, you know, crazy at the time, right? And then later followed up with the open source bridge data set from Berkeley, but it was still not like the hypers scaling era we are in today where that's quaint, right? Like if you turn 80,000 trajectories into number of hours, that's like hundreds of hours, right? That's not even thousands of hours. We are now at two, three, four orders of magnitude more data in robotics than back then. And I think a lot of the the ideas that are working now, for example, I really like the recent Gen One release from generalists where they are pre-training on 500,000 hours of kind of interaction data. A lot of those properties just would not have been possible back then. And in order to kind of gain the confidence and insight to collect that kind of high quality sensorized data today but that we do have today, I don't think it would have been possible to jump start back then.
You kind of needed to go through this co-development of of the hardware, the operations, the lessons and and wait for also other components in the ecosystem uh to get ready before you could actually make that leap. So I think at most you could have accelerated VA by one year but the other components I'm not sure it would have been like dramatically different.
Yeah. And I also also say like uh when I first met you Ted like during that time no one probably care about robotics and then obviously the feces all woke up one day maybe let's say two two plus years ago uh shortly after CHPD. So I I think partially you you need capital right um you know not everyone is going to be a Google and you need someone in in the bigger ecosystem you you need new startup with new ideas that is well funded to pursue these you know generalist being a good example right so yeah I think I think capital is a big one especially when some of these you you need service scale to to even test um and the I imagine for the feces they also need some some proof right uh and it will be hard to get those [laughter] unless you have some of these these things are starting to work at least in research.
>> Okay, cool. Shall we talk about your next error?
>> Yeah. Yeah. Speaking of proof, you know, I think maybe error of scaling I think is this this will probably be the most uh I'm sure we'll have a lot to talk about here. Um, you know, I think after, you know, VA, right? Like like what X I I think it was clear that VA's were were we're were very powerful kind of idea.
Um, and and maybe if we're if we're constrained by the backbone or by the data, let's scale both. Um this is a a work Gemini Robotics um uh which was first released in March 2025 more than a year ago. Um which was like maybe about a year a year a year and a half after our um RT2 and and the works we just showed um for the first robot foundation models. So, so now you know I think from [snorts] RT2 to Gemini robotics right the world was hypers scaling right pi was founded physical intelligence generalist there there was the first wave of like a lot of roboticists thinking okay now's the time VA is the paradigm let's scale it or or you know now is the time um and and I think there were other kind of trends too right we we we saw uh Aloha platform being very good and uh Aloha 1 of course being developed by by Tony at Stanford and Aloha to when he spent, you know, a year at Google DeepMind, you know, where the insight was that really high quality data um that could get you very high frequency control that could really max out dexterity um this two arm system that's very fast, very intuitive um could do things that you could never imagine on a one-armed 7off uh you know mobile manipulator. Um, so I think with Gemini Robotics, this was scaling up in terms of the embodiment complexity.
Aloha and the theronic humanoid you see on the right there are much more complex. Um, the the capability envelope of what these platforms can do if you master it is immensely larger than the the one-armed everyday robotics robot.
Um, the backbone of course Gemini 2.0 at the time uh that we kind of uh forked Gemini Robotics from was a very capable model. um from the start you know Gemini was always a very good multimodal model I think we can uh debate its performance on other domains like multi-turn dialogue or code but I think for multimmodal understanding images and and video it's always been a tremendous model um and uh of course scaling the data um where I don't think we ever released how much data we had on the Aloha platform but it was substantially a lot larger than what we had collected for the year and a half 87,000 trajectories Um, and with Gemini 2, there's something interesting, which is before we kind of didn't look inside the VLM. We just took Poly as the backbone for RT2. And it's okay. We'll take it as it is. for Gemini. What was interesting is that a lot of us, you know, my myself uh Fay uh FHA, Jackie Lang, many others were interested in like, you know, seeing inside the sausage factory, seeing how we were training like, you know, very very powerful and general vision language models like Gemini and realizing that the complaints that many in robotics have like, oh, VLMs hallucinate, they they don't have physical common sense. they don't have, you know, a notion of spatial reasoning and temporal reasoning. We can't rely on anything they do. At the Gemini theme, we actually had the kind of opportunity and responsibility to improve that. And I think uh what we did was this Gemini robotics er model, this embodied reasoning team where we kind of looked at how do we improve the spatial reasoning, the the the kinds of common sense physical intuition um that was lacking in these VLMs and improve it. So we did that. Uh, Gemini Robotics CR was a really cool VLM that could do things like object 3D object detection, 2D pointing, uh, and predicting grasp angles for objects. And then how can we then use that to kind of train this VA, this Gemini robotics VA that could do a lot more dexterity, do a lot more kind of generalization that wasn't present in um, previous VAS.
Um this then you know the next iteration of Gemini Robotics came out in September right before Coral uh maybe now six seven months ago Gemini Robotics 1.5 which once again had two parts that had the VLM Gemini Robotics ER 1.5 um which again was now a reasoning model we're now in the era of reasoning uh after Gemini 2.5 after DeepSeek after you know GPT 01 O series but the world kind of updated that wow this post training where you're using thinking thinking and inference time compute scaling was really powerful for VLMs.
Can we also kind of do this to improve embodied reasoning? And it turns out that indeed we could. Um and then from that model we kind of developed Gemini robotics 1.5 DA which I think has two main kind of contributions. One is just as thinking helped VLM thinking can help VAS. So thinking in text to kind of plan about what to do next, breaking down actions, long horizon actions into shorter horizon language instructions.
Um as well as motion transfer. Uh so motion transfer was how we can collect experience from different embodiment and without postraining directly kind of transfer those motions to other embodiment. So notably the same neural network um was able to truly take embodiment um from very different platforms from humanoids from this arm Franco from Aloha and then transfer motions between them. Um contrast this with you know open cross embodiment a few years ago where all of those were one armed tabletop robots with comparable action spaces. These are now three embodiment which are entirely different with very different kinematics, very different kind of ways of how you manipulate the world and now we're seeing transfer between them.
I think that's kind of bringing us I would now then say uh to wrapping up a lot of the work that was happening within Google DeepMind and and clearly within the last three years especially um you know and and of course before then robot learning has been a vibrant community of a lot a ton of great work from academia from other labs at Nvidia and fair and and and you know Boston dynamics and so many different and so many different places but I think in the last three years there's been an explosion of of VC interest of startups of academic collaborations of kind of new industry and startup labs spinning up and showing and humanoid companies that are doing amazing things. I think there's quite a few things that are being scaled up, right? Uh some of which are generality, performance and data. Uh and of course research entropy. Um I'll briefly maybe touch upon some things I particularly like. Uh maybe we can dive deeper into some of them. Um for model performance, right, I I I think there's some key trends that are happening. I think we see with PI 0.6 six and other methods that this robotics post training regime. How can you adapt a pretty good general policy that's generalizing to the specific long horizon precisionoriented tasks like making this you know the live demo pi showed at Nurips of this you know very long horizon especially making task we've also seen like you know from skilled and other players like GPU assembly or or from Dina they're post training on different laundry folding tasks we've seen scaling of generalization that's been very important for a lot of these robot foundation model companies where they're general generalizing to new situ situations is, you know, really remarkable. What we've seen in instruction and action and visual generalization, we've also seen the idea of evaluations being very important. As we claim our models can do everything, we have to back that up, right? And and running all these trials in the real world in one lab is very hard. So there's there's ways now trying to do that in simulation with these real to sim eval. there's ways of distributing these kinds of evaluations all across the world um with uh something like robo um arena um from Carl and others uh or way or ways of using world models for eval like like 1x or uh has shown or work that we showed at Google as well.
There's also of course been scaling of the data itself right there's there's scaling of thinking about hardware collect and model code design. I think I've I've been very impressed with the work from generalist and Sunday from the start of thinking about how you can really take um like hardware to the limit and of course recent work with egocentric data this has been you know a huge trend it's it's the cool kid on the block how can you scale human data amazing work from Nvidia from from uh you know from different places from pi from uh you know Georgia Tech downface work really amazing stuff happening in egocentric human data scaling and methods that can consume those and of course products right deploying products getting that Tesla data flywheel where the cost of your data is maybe subsidized by the fact that when you collect your data as a kind of side effect you add value to someone's lives if whether it's teleyop whether it's autonomous you get that kind of like real world data distribution and you start mining the longtail of all the rare in the wild kind of cases that you will need to tackle eventually um and finally I think This is my last slide. I think the research entropy is being scaled to maybe this is what I'm most excited about for robot learning is that there's we're now in a phase where we are not in a funnel phase. We are in an exploration phase. There is never been so much kind of diversity of opinions and very smart people with a lot of conviction who are sprinting in each of these directions. Whether it's world models and you know video action models, whether it's robot free data with human data or sensorized human or or or um data, whether it's leveraging the kinds of verifi, you know, training from verifiable reward that we've seen become the main paradigm in frontier labs outside of robotics. How can we get that in robotics? Um and finally I think the all of the talk today was about manipulation but locomotion has seen a similar Cambrian explosion and and progress and maturity of the technology with dancing back flipping kung fu robots where it's almost commoditized now but that entire line of work has been done you know with sim tore whole body control with very very different methods right that is completely zero shot sim to real it's reinforcement learning that's you know on policy online reinforcement learning on very small networks. Uh and I think it's very interesting that many compare the manipulation world to kind of emulating what goes on in the human brain with a very like intelligent first uh perspective of you know learning from demonstration, learning from expert data, supervised learning. um versus a lot of what happens in locomotion is closer to the you know uh cerebellum or or from like the lower spine where a lot of this is like reactive or instinctual and so the question of how we're going to merge these two worlds and then also how we merge the world of manipulation with a world of reasoning with like if you're playing chess right maybe it's a bit of both um I think these are very very interesting open questions and I'm pretty sure whatever kind of breakthroughs we see here or elsewhere is going revolutionize the game once again. So, I think I'm really looking forward to kind of seeing what the community is going to do on all these amazing topics. But I think with that, you know, I'm happy to Yeah. kind of pause there and uh kind of overview.
That's the era of scaling as I see it.
>> Good question.
>> Yeah. Yeah. Thanks for talking, you know, it's the history uh like you know the past of robot learning and you know what you have went through and looking into the future. I have a question. I think okay I guess maybe I think of it this way like how did the role of academic research you know for robot learning sort of shifted to you know is it is is today doing academic research in robot learning is still going to be relevant or is it you know is industry going to be the one that s of you know take the lead or is like maybe startups you know things that can iterate fast and you know push some of this idea much faster like what what's your take on this cuz I I guess when what I mean is that because if you look at the way you sort of started this talk you talk a lot about like those academic research right so these are from Sergey Chelsea you know it's like small papers that s of bring certain idea and towards the end of the talk it's all about the startup hypers scaling has academic research sort of like you know diminish you know in terms of the the kind impact for for robot learning in today's context >> yeah I mean I think it's a great question and I I think it's there's different forms of impact it's almost like a speaking to the maturity of the field that the potential of actually commercializing some of these ideas is now even possible, right? And and and just like fields evolve and grow over time. I think we're maybe seeing potentially the first wave of of an attempt like that. I think there's definitely some things which are much easier to do in industry. I would view kind of this opinionated kind of like data scaling with this like hardware data model co-design as something which is only possible in startups and you know with conviction from the start right they both generalist and Sunday took you also a year year and a half of kind of just like heads down grinding to get to the point where they're at the critical mass of seeing very remarkable performance. Um at the same time you know what does academia do? Do you just wait for players to open source a data set? For example, egoverse. I know Mecca is doing great work at maybe releasing or sharing some of their data. There's a consolidation of egocentric data that folks can share and iterate on. There are kind of efforts of maybe like uh decentralizing evaluations or data collection like Droid. Um but I think like you know when academia is competing on capabilities these kinds of cross institutional collaborations are going to be very critical um because we're in a world where demos are nice but I think results and capability matter even more right so it's like will like a model that's topping these kinds of leaderboards come from academia I'm not really sure right if you look at you know even everything here you know I think dream zero very very amazing model and the rest are kind of like pi or or internal pi models versions of those right I think just the compute requirements the data requirements the operational requirements you know to to iterate on model design these days you know a lot of it is just the blood sweat and tears of being next to the robot having access to it having access to a fleet of robots a lot of those entry fees you know the cost to play in the space is quite high so I think there's a lot of interesting research and contributions that come from academia if you just look at the past few years like fundamental algorithmic improvements as well as interesting analysis on models that are released. If you look today uh you know last year's coral and I'm sure this year's coral um in RSS you'll see a lot of models that are probably building off of PI PI 0.5 Dream Zero um that are building off of Droid um simpler um a lot of these release data sets but I do think the role of academia is changing um you know and I and I think like maybe in terms of capabilities in terms of performance in terms of application and commercialization I think academia is farther from those than before. before you might directly commercialize ideas from academia directly to product. I think these days that role is taken on by industry. So there's this like industry is the middleman that scales things up that tries out ideas at the forefront of the bleeding edge of how good can robots get and I think academia is providing insights rigor that's providing kind of fundamental ideas but then to prove out those ideas I think they need to happen in industry and then when industry proves them out maybe they get deployed into products um with like vertical robotic startups um and I think that's also why you know just frankly speaking a lot of a lot of very strong academics in robotics. A lot of them have part-time roles at, you know, industry or startups or a lot of them are on leave doing their own startups just because I think there's a graduation or maturity of the current, you know, wave of technologies. We'll see if it lasts, but uh that's kind of what I'm seeing a lot. Okay. So maybe if I can just quickly follow on this question. So if that's the case, will you advise someone who's interested into getting into robot learning to do a PhD at this era?
Yeah, I think it really depends on, you know, what motivates them. I think uh there the I think academia of course always has the perennial benefits of the the the the mentorship, the learning, the the focus to really study a problem deeply. Um if you have high conviction of like a specific area that you think you know that you you that isn't kind of something that industry or startups are exploring. um you know if it's really off the beaten path I think it's still a great place if you want to you know pursue teaching and and in the future I think it's a great path that being said you know if you care about the capabilities right like like pushing the the capability of of the of the actual models you train of getting to the largest scales I think there's a lot of phenomena today that only appears at these super large scale kind of plays I don't think you get access access to those phenomena in academia so and of course there's many other interests to like if you if you're if you if you're entrepreneurial, if you want to, you know, kind of realize not just theoretical kind of fundamental research improvements, but you also want to see an impact in terms of actual robots that are deployed. This was not really possible 10 years ago or even 5 years ago is to imagine that your robot learning research could actually touch human lives in the near term. But that's of course very possible and happening all around the world now. So, it's like I think just the there are opportunities that weren't there before and some of them are weighing very heavily. I think for a lot of people who self- selected into robotics before, they are drawn to these. And so, if if you're a new student, you know, I think there's a lot of soularching you should do. Um, and of course, there are ways to try to get the best of both worlds. Do a PhD, but then spend a lot of it interning at different places, collaborating. I've seen a lot of students succeed at that quite well.
Um but then I think there's also students unfortunately who I think are caught in the middle who you know end up dropping out very early or something. Um and and and for those students maybe it would have been good to just directly go to startups or industry.
>> Maybe as a proxy to that question um what what do you think of the role of open source um in robotics? Right.
>> Um yeah, I mean obviously academic by default usually they open to us. Um in the past I would say the big industry player also open source more. Um yeah do you have any views on this?
>> Yeah good question. I think it's an aside. I think it's kind of a shame that you know probably if all of the startups and industries right now had just open source everything they have we probably have I don't know one two >> way more data in the data we probably solve the whole thing right and everyone just was like ah okay like let's >> yeah industry scale open x embodiment yeah we we are going to open source something >> soon along those axis yeah so open source is not there yet >> um yeah it's very important, right? I I think it's how science gets done. I think we are now clearly in an era of research, right? Like like all these crazy ideas. I think if this was a few years ago, we would make very rapid progress. But now we are in an era where everyone's in their own silo, right? If you're betting a lot on video action models or you're betting a lot on egocentric data. Like maybe previously we could have answered some of these questions faster in theory if we had been sharing. But now that's not really happening. And then for the people that are sharing, maybe even add it up. All of the collaborative efforts, they add up to like one one pie, right? They add up to one skilled or less, right? I I don't know. Like like it's it's hard to kind of measure this, but like and and and it's it's of course very hard to ask for, hey Pi, hey skilled, hey generalist, like just release all of your IP to the world, right? Like that that's come on. Like that took them a year and a half of very hard work to make. So, it's kind of like this kind of weird structure where maybe we're not globally optimal for humanity, but we're locally optimal for every player. So, I don't think I have a good answer to this. I can empathize with all parties.
I think it's just I just sometimes am nostalgic for the good old days where like everything is published by default.
Everyone is sharing the latest work uh you know at workshops, at conferences, they're publishing their open sourcing.
Yeah. I mean, if I can just make make a passing comment. Obviously we run this uh port you know mostly uh unlike this particular episode mostly just like featuring particular paper a lot of them are increasingly you know [laughter] from China actually they're fantastic research and they're still really publishing a lot you know often time big models you know with open weights as well yeah anyway but it is what it is and also I so shout out to the hucking face folks right who who are doing a lot of uh you know open source yeah um I kind of feel that like all of us even those you know who are working in big industry you know player that that it's or startups that you know obviously you know there needs to be some kind of value capture so I totally understand I'm an entrepreneur myself so I understand that there needs to be some kind of value capture but I think all of us definitely benefited from you know the the open sourcing right you know [laughter] >> um yeah it's all built on you know should you know giants right um so yeah uh maybe one last question for me uh Ted uh so give us like today Like is there any old ideas like begging maybe the first error for example online RL like uh that you think it's maybe time for a pickup bag or or like like what is what is something that you think it's try and test it and kind of like okay maybe that doesn't work but you are a lot more hopeful that maybe it's worth trying at at scale again. I think maybe one thing that I'm just look at the ideas that were kind of like happening in this era of course I think you know with the pre-training and post-training now like you know let's say 5.6 six and other methods. It maybe we we start looking at the IO plus methods of real old oper. I think a lot of this right new is old. A lot of this rhymes. I think maybe one thing that I I would say I think we might actually be a good time for is this kind of like learning from play setting where where where you where where where you really take a uh coverage of the state space where you're kind of just doing useful tasks but like not with the goal in mind and then and then you kind of like just allow kind of human curiosity to go and explore the entire search space and then use hindsight goal reabling to kind of like you know backtrack from that. I think at the time this is very hard because the scale of data you're doing I mean this this entire paper was a thim but to do do play data in the real world is very hard um of course others like for example o me fryberg at the time set up his like learning from play setting but I think it hasn't really been seriously attempted since then um but now in the era of 500,000 hours from generalists or you know we'll probably be hitting million hour data sets already right with of some of these egocentric data startups like maybe this is actually enough where we are covering the global state space enough such that we're covering this playlike exploration is actually tractable because before we were never never covering it enough we have to be very kind of strategic and where we spend our real world budget but maybe now we can actually do it yeah very cool >> I have a last question so you know like everyone do they post on X like you know this is GBT T moment robotics craft like GBT moment this GB moment that like when what like obviously these are cool demos right but when do you think you will be convinced that you know this is actually the GBT moment of robotics like what do you have to see to say that okay you know robotics is I use the word soft bite you know we are there yeah great point I think for this I always decouple it into two parts right so one is what technical research breakthrough was a chachi GP moment, right? Was it the transformer?
Was it GPT2? Was it GPT3? Uh this the, you know, scaling laws, was it GPD 3.5 instruction tuning? There's many of these I think that made up together the technical kind of breakthroughs. Um it was not chatb itself like cap itself in what like December 22 um was I think a UX thing. It's like a weekend project that someone said, "Oh, let's turn this into multi-turn chatbot and let's release it." And then it went viral. So I think part of it was like the product form factor. We call it the chat GPT moment because of the fact that people loved it, right? It was it was good enough that people from all around the world with different experiences and technical backgrounds could use it and find value out of it and it was a delightful experience, right? So part of that was just the the the productizing that ties into the commercialization of it and part of it was a research breakthrough. So I think like the product part is a lot easier for me to answer. The cha GPT moment of robotics will be when there is some kind of like general purpose manipulation system that touches a huge like bash of society in a way that isn't just like a tiny sliver.
It's it's when it becomes so accessible when it's so delightful when it's adding value to people's lives in a very like you know democratized fashion that people just love it and get it right.
What would that look like? you know, I'm I'm very excited by the work of like I know companies that are focused on deployments these days, especially in like you know, consumer centric environments. Um, you know, whether it's, you know, like startups or whether it's it's companies. Um, I think overall I'm not sure that we'll be seeing these in society consumer moment for for society. I don't know if we'll see it in the next year or two. I definitely hope so. I'm on all the wait list. As soon as a consumer rollout drops, I put in my deposit. I'm on all of them. I really hope it happens. Like I think it's a it's it's not easy, you know. It's it's it could be that we are in the early stages of the decade that took self-driving from the first kind of self-driving explosion to when Whimos and Teslas and and and and uh you know, Zuk's are are are going around San Francisco. Um but that being said, you know, I think uh for the technical side, what was a technical breakthrough that made things possible?
I think there it's also a lot hard to to attribute. you know there isn't like a single viral moment where things just click even for language models right was it was it the transformer was it GP2 was it yada yada like I think for robotics you know I think it will similarly be a combination of pieces that all combined where the minimum like threshold for utility was reached and that's when it could be commercialized whether we're there already I'm not sure I think a lot of companies certainly think so and that we're in execution mode um and I think a lot of you know very smart researchers also that will need another breakthrough. And and maybe kind of like you know one of the maybe maybe some of the results from you know the the last slide here around the scaling around the architecture the video action model maybe around kind of like a very advanced whole body reactive system.
Maybe some of these will be the last puzzle piece, but I think I'm much more confident about my prediction of the consumer, which is like, you know, I think it could happen in as soon as a year or two, but I personally think it will be a bit longer and it's going to look like something that's very product focused, very UX focused. Um, that's when the chat GPT for robotics moment will be in my opinion.
>> And then in the way that you know how we going to get there, you have to place 100 bucks today on one of these four ideas, which one will you place here?
>> I think it's got to be one of the top two. I know it's a boring answer. I I've uh but uh you know I think of course because they're because they're the they're the you know front runners that everyone loves video action models.
Everyone loves egocentric data today. I think like these two are kind of like the technical breakthroughs that I think will win. And in terms of like a product if if there is a product that succeeds in the next year or two I think it's got to be you know generalist or or uh Sunday. I I think they've again they're my friends and colleagues. I'm biased of course, but like yeah like I think it's like they've done amazing work. Um and I think they've been very focused right on like what that end to end user experience looks like. Um I would you know and I think consumers would be if done right there's a path where if they deliver on the technology that consumers could adopt that. So I think consumer side I think the customuilt kind of like for the home code design is unbullish and then for the research ideas I'm bullish on video action models or human data.
>> Well uh thank you so much Tad. Um yeah this is this is really really a very very fun talk. Um thanks for making the time man. Um and I'm glad finally we got you on the on the pot. Um >> and uh yeah all the best for your your Nick next gig man. Um hopefully you know at some point you we can get you on the pod again. [snorts] Uh okay cool. Uh I think I think that's it then. Let's wrap up. We'll see you soon.
>> Thanks guys. Really appreciate you inviting me on.
>> All right see you next time.
Videos Relacionados
Beyond Robotics | European Rover Challenge 2026
beyondrobotics
189 views•2026-06-01
Beatbot Sora70: JetPulse Technology and AI obstacle avoidance and navigation!
DroidModderX
26K views•2026-06-02
Tesla FSD 14.3.3 Hits Phoenix Streets - FIRST LOOK
anthonystesla
114 views•2026-05-29
Elon Musk Just Revealed Fremont Line for Optimus Gen 3 Mass Production
TheAINexusOfficial
180 views•2026-05-30
人機一体「零式人機 ver.2」 子ども企画【おもしろ発見!モビリティー】 #乗り物 #automobile #robot #shorts
KyodoNews
1K views•2026-05-28
China’s New Luna AI Robot Looks Shockingly Human...
NextGenHumanoids
850 views•2026-05-28
Reachy Mini: the $300 open source robot you can actually hack — Andres Marafioti, Hugging Face
aiDotEngineer
662 views•2026-05-29
柔軟指×AI画像処理食品の仕分け作業システム!#柔軟指 #ロボット #自動化 #製造業をもっと盛り上げたい
KiQ_Robotics_Corp.
113 views•2026-05-28











