RLDX-1 correctly asserts that true intelligence is embodied, moving beyond abstract reasoning to the high-fidelity synchronization of tactile and visual feedback. This architecture marks a critical pivot from "brain-heavy" AI to robots that can actually master the physical nuances of the real world.
Inmersión profunda
Prerrequisito
- No hay datos disponibles.
Próximos pasos
- No hay datos disponibles.
Inmersión profunda
RLDX-1 | Dexterity is IntelligenceAñadido:
All right. Hi, I'm JC from RLWRLD. We just launched RLDX-1, which is our first robotics foundation model for dexterity. Thank you for all being here who actually built RLDX.
Before we get into details, David, tell me why are we so obsessed with hands?
The tasks we care about requires human level hand capability. When you are pouring coffee, the cup gets lighter. So you need to adjust pressure with your full finger grasp. If there are moving object on conveyor belt, you have to grab in time. When you are replacing a light bulb, you need to rotate it fingers. Logistics service manufacturing a gripper can just pick and place.
But this test need a hand. But today's village already can fur so important dong. Many people guess that it enhance the model intelligence. Even with the creepers they expect it will figure up everything more data par more more data and care of the data model parameters but there are some task remaining challenge this is not due to a model intelligence the main challenge is dexterity so we figure out the necessary of five fingers dexterity we designed the part of the system around them that is little I see but it doesn't seem like the hand is the only problem exactly beyond just having the hands existing models still have so many missing pieces. They cannot feel, they cannot remember and they cannot understand movements. That's why existing models fails on so many tasks. Yeah. Oh, here's the one case. When we pour the coffee, the coffee pot gets lighter.
Visually, almost nothing changes, but realex reads torque in real time. It feels the way shifting and knows exactly when to stop. It gives the model sense of touch. So, RLDX could see and feel.
What else? He won motion understanding. Think about a movement on a conveyor belt.
An object can move left to right and fast or slow. So you need a temporal context. But the model takes only a single frame and decides what to do. So by the time the hand arrives, the object has already moved. So real captures motion context between multiple frames. So it knows where things are going, not just where they are. And then there's memory. Say you need to put exactly 10 apples into a bag you can see inside. The motor only sees the current frame. So it has no idea how many are in already there. So as our model architecture, we chose MSAT, a multiream action transformer to resolve these problems and I will talk about it in more detail later.
Wow, thanks for the explanation. Um what about the data size? That was another challenge and in the lab uh you can collect demonstrations much more easily, but factory environments are often busy and less organized and full of extra things in the sim. So collecting data in those settings is very hard. The main reason is the environment itself not the cost and even after solving all these problems still it is not a ready to deploy model. The model [music] learns a lot from imitating demonstrations but it doesn't know how to recover from mistakes or how to make optimal moves toward the goal. This doesn't just happen out of nowhere. Oh that's a lot of problems. So how did you design around all of this tail? We see challenge as a core constraint.
So the dexterity, the way we say it, it's just not about the hand. It's the ability to overcome these constraints. So we approached each one directly because each carries a different kind of signal and those signals are all completely different. So right away the question is how do you build a single model that handles all of them together? All right, so let's get into this.
I want to hear how you actually build each piece. So let's start with the model architecture.
We work on the core model architecture. So the question was how do you process video, memory, tactile feedback and even proper reception all together. In real world manipulation we can afford to miss any of them. Yet most approaches barely handle one or two modalities in a single stream.
But when you try to integrate diverse modalities all with different structures and temporal resolutions that just breaks down. Here we chose MSAT a multistream action transformer where each modality gets its own dedicated stream and joint self attention lets them communicate.
That's how we process all of them without one dring out the rest. The bra is models eyes and brain. It handles spatial temper and higher back listening for action generation. To support temporal reasoning, we train vision encoder to extract motion specific information [music] and we training video ranging model to effectively process video for action generation.
Furthermore, we utilize data for spatial and high action reasoning is high action grounded into rural action. As a result of VLM, this is not only understanding the word but also understanding how to act. We also had to bridge the VLM and the head. So the VLM outputs a very long sequence of embeddings. Therefore, passing it all of it to the head makes it really inefficient. Instead, we use a larable set of queries that could pull out only the information needed for action prediction that gave the head much more compact input to work with. So we could get high performance while getting the inputs much faster with 35% of speed up. And on top of it, we used the memory module to manage the tokens so that we can retain relevant context reality roll out. So that's the architecture. And the next question was data. Robots are still rare. Then how do you get enough highquality manipulation data? That's where our data pipeline comes in.
We worked on building a data pipeline for real decks. Real robot data from factories are very hard to collect because the robot environments are often busy and less organized than rep settings.
Also, getting access to these factories is not just a money problem. safety rules and access limits and worker schedules. So in our model, we built neuratory synthetic data pipeline based on video gener models. We first trained the video model on a small set of your robot demonstrations.
Then we generate new videos that are not seen during training but can still happen in the real world making the data set about five times larger. And these changes includes such as different objects on worker surface and lightning and backgrounds. And I heard that your approach is a little bit different of mine. So, how do you do it? Oh, yeah. We are building RLDX for the Dexterous hands. And there's no better teacher for learning dexterity than the human hand.
We naturally adjust the grip pressure without thinking. We positioning our fingers mid task. That's the kind of behavior the robots need to learn. And that data has to come from the real human hand. So we prepared this demo to show that capturing the human hand data has a large potential. The problem is tell operation can't capture that joystick or puppet arm is too slow, too imprecise and never feels natural. So we build a data capture pipeline that lets skil operators demonstrate with their own hands directly. And it's not just about the capability, it's about scale. With our setup, we can collect more than 200 demonstration in an hour. So cool. Since our generic models do not have action labels, so we use inverse times models to annotate the actions. This gives us video action pair. As a result, we can create a much broader range of situations than PC data collection alone can cover. However, using these pairs as they are can be risky. We introduce a creation method to filter incorist pairs because generative models may produce physically impossible motions or inappropriate action labels. We verify each pair by it playing the extracting actions in the simulator and discarding pairs where the replayed motion does not match with generic videos. And this is how our pipeline works. We track the human hand and object together in 3D. reconstruct the workspace using 3D causian splatting and then retarget the entire trajectory onto a robot hands in simulation.
This is how a single human demonstration becomes a useful training data for robots.
With the model and data in place, the final challenge was making work reliably on real tasks.
As I already mentioned before, post training give us a capable model, but capable and reliably deployable are different things. Simple imitation learning alone cannot get you there. Fine grain manipulation is hard and the accuracy is low.
Instead of copying demonstrations, the robot learns what actually helps for the task. It takes an action and sees the result and finally judges whether there is a progress. The question is how the robot sees the progress. We leverage a VLM's visual knowledge for it with a bit of post- training on a small set of data. We could make the VLM capable of estimating progress for RL training. There is strict requirement for training real robot with RL which is sample efficiency. Every physical trial hosts our effort and time which is expensive.
So it was natural we maximize the models learning from its own past experiences. We fully leverage the benefit of batch on policy data rather than constantly sampling from the model on the fly for every training steps. Combined with progress aware R training, we achieve three times faster test completion than imitation learning policy. RL policy result in faster and similar behavior.
From just a few number of demonstrations, the model discovers its own efficient way to solve the task. Demonstrations show what to do. RL defines how to be better from those. Furthermore, in aspect of the data what the model needs to focus it on it failure and run from them just like using an incorrect answer not load book when you are studying the data regation algorithm in in running called digger makes it possible here's how it works the post train model is deployed on the task meanwhile any suboptimal or [music] failure situations are checked at the same time an expert intervene to correct those situations and those corrections [music] become new additional data set. These are aggregated to exist data set and we train the model.
This feedback loop is repeated. Then the performance of our model increase gradually and finally converge. This is how we close the last smile between it kind of works and it actually po thanks for sharing all of that. So we started with a question why hands because the task we focus on demanded pouring the copy and capturing the moving object on the line assembling with precision.
Each one expose the world each word become a model and together they become the little the single model that can see build remember and adapt. Dexterity is intelligence. That's RLDX.
And we are building the hands. Let robots work where people work. Let's get back to work.
Videos Relacionados
Beyond Robotics | European Rover Challenge 2026
beyondrobotics
189 views•2026-06-01
Beatbot Sora70: JetPulse Technology and AI obstacle avoidance and navigation!
DroidModderX
26K views•2026-06-02
NVIDIA Just Unveiled a 75 DOF Humanoid Robot
DPCcars
74K views•2026-06-01
Tesla FSD 14.3.3 Hits Phoenix Streets - FIRST LOOK
anthonystesla
114 views•2026-05-29
Elon Musk Just Revealed Fremont Line for Optimus Gen 3 Mass Production
TheAINexusOfficial
180 views•2026-05-30
人機一体「零式人機 ver.2」 子ども企画【おもしろ発見!モビリティー】 #乗り物 #automobile #robot #shorts
KyodoNews
1K views•2026-05-28
China’s New Luna AI Robot Looks Shockingly Human...
NextGenHumanoids
850 views•2026-05-28
Reachy Mini: the $300 open source robot you can actually hack — Andres Marafioti, Hugging Face
aiDotEngineer
662 views•2026-05-29











