Instala nuestra extensión para buscar dentro de cualquier video al instante

Yann LeCun's $1B Bet Against LLMs [Part 2]
Añadido: 2026-06-04

8,022 vistas76140:57WelchLabsLanzamiento original: 2026-05-30

LeCun’s pivot toward JEPA highlights the fundamental flaw in relying on behavioral cloning for true agency, favoring causal world models over mere statistical mimicry. It is a sophisticated bet that structural depth will eventually outpace the superficial scaling of current VLA architectures.

[00:00:00]This video is sponsored by Kiwiico. More on them later. The startup Physical Intelligence builds some of the most impressive robot brains ever demonstrated. Here's their PI07 model.

[00:00:12]Peeling a zucchini, folding a pin wheel, and taking out the trash. PIO7 is a vision language action or VLA model.

[00:00:20]What's your expectation here? Do you think Jeepa based approaches will eventually overtake VA approaches?

[00:00:25]>> Oh, absolutely. Yeah, VA are doomed. I mean they they basically don't work really well.

[00:00:30]>> Last time we followed Yon Lun's path to Jeppa, an alternative architecture for building AI models. Like VLA models, Jeepa approaches can also control robots. But Jeppa's demonstrated capabilities are significantly behind.

[00:00:45]Here's Jeepa taking 60 seconds to move a cup off a platform.

[00:00:50]So what makes Lacun so confident here?

[00:00:53]Are these VLA approaches that look incredibly impressive right now actually doomed? VA models are in many ways the pinnacle of the current mainstream generative language driven approach to AI.

[00:01:06]VLA models are built on top of VLMs, vision language models and VLMs are in turn built from vision encoders and large language models.

[00:01:17]At each level of the VLA stack, there exists an alternative JEPA based approach with various trade-offs and in some cases impressive advantages.

[00:01:26]In this video, we'll work our way up this alternative stack. We'll see how a video-based model called VJeppa 2 compares to the language supervised encoders that we find in many modern AI systems. From here, we'll tackle vision language models. These include AI assistants like ChatGpt and Claude.

[00:01:45]Interestingly, we can reframe how these models are trained using a jeepa approach and achieve some impressive results. Finally, we'll zoom out into a full robot control system. This is where Jan's philosophical differences are the most pronounced. I do not understand how you can even think of building an agentic system without a agentic system having the ability of predicting the consequences of its actions.

[00:02:12]>> Okay. And VA doesn't doesn't do that.

[00:02:16]>> Sure. Right. Do not have world models.

[00:02:18]>> We'll explore exactly how JEA learns a world model that can be used for robot planning and control and see what advantages this approach might have over VLA approaches.

[00:02:31]Modern AI systems have become remarkably good at bringing together vision and language. Chatbots can give highly detailed descriptions of images. And we now can even go the other way, mapping text descriptions to incredibly realistic images and video. Much of this progress can be traced back to a 2021 OpenAI paper and model called clip. In part one of this JEPA series, we saw how contrastive learning could be used to train joint embedding architectures by training our encoders to output similar vectors for corrupted and non-corrupted versions of the same image and to output dissimilar vectors for different underlying images.

[00:03:11]Clip works in a similar way, but instead of using corrupted and non-corrupted views of the same image, clip instead uses image caption pairs where images are passed into a vision encoder and captions are passed into a separate text encoder model. From here, the clip algorithm maximizes the similarity of the embedding vectors produced by matching image caption pairs while minimizing the similarity of the embedding vectors produced by non-matching image caption pairs. For more on clip, see the video we did on diffusion models with three blue, one brown or chapter nine of the Welch labs illustrated guide to AI.

[00:03:48]After training, the clip vision and text encoders can be repurposed into a wide range of AI systems.

[00:03:56]One common application is making large language models multimodal.

[00:04:01]When you give an AI assistant an image, the image is typically passed into an image encoder model that was most likely trained using a clip-like approach. The encoder extracts meaningful information from the image that can then be used by the LLM. This combination of a vision encoder and an LLM is often referred to as a vision language model or VLM.

[00:04:25]Now let's consider a Jeepa based alternative to the popular clip algorithm. VJA 2 was trained by a team at Meta in 2025 on 1 million hours of video and uses up to 1 billion parameters, making it one of the most ambitious Jeepa models trained to date.

[00:04:43]As we saw last time, in the JEPA architecture, we pass our inputs X and our outputs Y into encoder models which each return embedding vectors or matrices. From here, a separate predictor model predicts the embedding of Y given the embedding of X. The VJEPA 2 team used a self-supervised training approach where video clips are corrupted by removing patches. The corrupted and uncorrupted video clips are fed into encoder models and the predictor is trained to predict the embeddings of the missing patches.

[00:05:17]And the big idea here is that by learning to fill in the missing pieces of videos, our Jeepa model will learn how video and by proxy how the world shown in these videos works.

[00:05:30]Just like the clip image encoder, our VJA 2 model takes in images or videos and returns embedding vectors. Note that natively clip only supports images, but is often used to process videos one frame at a time. Now, what would happen if we swapped in the VJuppetu encoder for a clip vision encoder in a vision language model?

[00:05:52]Yan Lun's new venture, Ammy Laps, has a line on their landing page that really gets at the heart of Lacun's philosophy.

[00:06:00]Real intelligence does not start in language. It starts in the world.

[00:06:06]While Clip and VJA both produce trained vision encoders that take in images and video and return embedding vectors, their training objectives are remarkably different. VJA is blissfully unaware of language exclusively trained to predict the missing parts of video while clip is trained to produce embeddings that match the embeddings of the language descriptions that we give to our images through captions.

[00:06:32]So VJeppa is not aided by or constrained by the language that we've invented to describe the world. The model can learn how to represent concepts like cats however it wants as long as those learned representations help the model fill in the gaps in videos of cats.

[00:06:50]However, this flexibility raises an important question for applications like the vision language models we're exploring. Will VJA 2 learn representations that our language model can actually use? Will a model trained exclusively on vision be able to interface with a model trained exclusively on language?

[00:07:09]The VJA 2 authors go on to show that not only does this work, but that swapping in the VJA 2 encoder achieves state-of-the-art results on a set of video understanding benchmarks.

[00:07:21]As the authors say, we show that a video encoder pre-trained without language supervision, can be aligned with a language model, and achieve state-of-the-art performance contrary to conventional wisdom.

[00:07:34]These video understanding benchmarks include a range of skills.

[00:07:39]Here's one example from the temp compass benchmark where the model is shown a video of a person picking up a pineapple and given multiple choice options about what's happening. Interestingly, in a variant of this question, the video is played in reverse, changing the correct answer. For reference, in our testing, chat gpt 5.5 gets this question wrong for both forwards and backwards videos, and only some versions of Claude and Gemini get the correct answer. So, VJEPA 2 shows that remarkably a Jeepa based approach can produce competitive and for some benchmark state-of-the-art results when used to train the vision portion of vision language models. Now, this is still very much a hybrid approach applying Jeepa to the vision portion of our model while our full VLM still uses standard generative next token prediction objectives on language.

[00:08:31]But is it possible to apply the JEPA architecture to our full VLM? In the most widely used VLM architecture, our images or video are passed into our vision encoder and the resulting embedding vectors sometimes with modifications are passed into our LLM.

[00:08:48]Our prompt is tokenized and also passed in to our LLM. From here, our LLM directly outputs text one token at a time. Now, let's see if we can map our VLM architecture to a Jeepa architecture.

[00:09:02]Following the JEPA approach, instead of directly generating output text, we pass our target output text into an encoder model and train a predictor model to predict the embedding of our output text.

[00:09:16]Aside from this new prediction target, the rest of our standard VLM architecture actually maps pretty cleanly to the Jeepa architecture. Both architectures already pass their inputs into encoders.

[00:09:28]In our standard VLM architecture, our vision embeddings and prompt are passed into our large language model. In our JEA architecture, our predictor model takes in our embedded images or video.

[00:09:40]And as we saw last time, we can also pass in additional information into our predictor model. This is known as conditioning.

[00:09:47]Here we can pass in our prompt directly into our predictor giving our predictor model access to both vision and text inputs.

[00:09:55]So architecturally the language model in our VLM architecture and the predictor model in our Jeepa architecture have very similar jobs and take the same inputs.

[00:10:06]The key difference here is that our JEPA predictor model's targets are the embeddings of our output text, not the output text itself.

[00:10:14]So, how does this Jeepa version of a vision language model stack up? Last time we saw that a key advantage of the Jeepa architecture was not having to reconstruct full outputs. In theory, the encoder model will extract the salient features of our output while ignoring extraneous details. Yan gave a nice example. If you train a generality model, you know, to predict what's going to happen in the dash cam video, uh it will spend most of its resources predicting the random motion of the leaves on the trees that bord bordering the road and and those are things that are essentially not predictable, but they have a lot of pixels, you know, that move around.

[00:10:53]>> A similar argument can be made for the language outputs in VLMs. If we ask a VLM if it's safe to eat a mushroom shown in a picture, there's a variety of ways the model could phrase a correct answer.

[00:11:04]But our training data likely only includes one phrasing. So if the correct answer according to our training data is do not eat this mushroom. But our model instead returns this mushroom is not safe to eat, the model will be penalized during training for what is essentially a correct answer. Alternatively, with a Jeepa architecture, these phrases are mapped to very similar embedding vectors, abstracting away irrelevant semantic differences in our prediction targets.

[00:11:33]In late 2025, a research team at Meta showed that this vision language JEPA architecture, which they called VLJA, produced some impressive efficiency gains. In a controlled experiment where a VLM and VLJA architectures are given the same exact vision encoder and trained using the same data and training configuration, the VLJA architecture learns significantly more quickly, reaching a video classification accuracy of 35% after 5 million training examples compared to an accuracy of just 20% for the traditional VLM architecture. So by learning to predict the embedding of our target text Y instead of Y itself, VLJA is able to learn significantly more efficiently, arguably by abstracting away the irrelevant semantic details of the target training text. This efficiency increase can lead to impressive results, including at performing significantly larger models on visual questionans answering benchmarks.

[00:12:34]The GQA compositional reasoning benchmark includes tricky visual reasoning questions like figuring out from this image if there is any fruit to the left of the tray the cup is on top of. Impressively on this benchmark, VLJO was able to outperform 7 billion parameter models while using just 1.6 billion parameters.

[00:12:58]Now there is an important wrinkle when using VLJA since the model is not generative. It does not by default spit out answers to questions.

[00:13:08]The team worked around this limitation in a couple of ways.

[00:13:12]One approach is to pass a given image and question into the model to produce a predicted embedding vector and then pass in all possible answers for a given benchmark into the Y encoder and choose the answer that produces the most similar embedding vector to the predicted embedding vector. This is like giving VLJeppa multiple choice options to the benchmark questions.

[00:13:34]Finally, the team also experimented with training text decoders to map VLJA's predicted embeddings to text, allowing VLJA to act like a generative model at inference time.

[00:13:46]So the Jeepa framework has some really interesting overlap with the vision language models behind AI chat assistance providing a path to potentially stronger vision encoders like VJEPA 2 and through architectures like VLJA an embedding space training objective that allows models to learn more efficiently.

[00:14:06]But what about the vision language action models we saw at the beginning of the video? These models effectively turn LLMs into robot brains, taking pre-trained vision language models and training them to output robot control signals, given instruction prompts and feeds from the robot's cameras and sensors.

[00:14:26]Early VLA models had the large language model directly output robot control signals, while more recent implementations, including the PIO7 model we saw earlier, use a separate model called an action expert to interface with the language model and output final control signals. Check out the Welch Lab's video on VLA to see exactly how these fascinating models work.

[00:14:50]Interestingly, VLA models are where we find the strongest contrast with Lacun's Jeepa philosophy. What's your expectation here? Do you think Jeepa based approaches will eventually overtake VA approaches?

[00:15:01]>> Oh, absolutely. Yeah, VA are doomed. I mean, they they basically don't work really well.

[00:15:06]>> So, what exactly does Jan see as the big issue with VLA and how does Jeepa address it?

[00:15:13]How do Jeepa and LLMs compare to human learning? Lun has an interesting take here, showing with some back of the envelope math that the average four-year-old has actually taken in more bites of information through their visual cortex than even the largest LLM will see in all of its training text. If you find yourself thinking about how the children in your life are learning, check out this video's sponsor, Kiwiico.

[00:15:36]KiwiCo makes hands-on project kits that make learning genuinely fun for kids of all ages. My son is dinosaur obsessed right now. So, this dinosaur dig crate was absolutely perfect. His language is really progressing and it's wild to hear him pronounce these complex dinosaur names.

[00:15:57]>> Tiff.

[00:15:59]>> And assembling these intricate puzzles is great for developing his spatial reasoning. I had to borrow the crate to take these overhead shots and he literally has not stopped asking for it back. My daughter gets a little anxious at the doctor sometimes, and this doctor kit is great for getting her used to all the parts of her checkups. She loves following along with this checklist.

[00:16:21]As usual, the thoughtfulness and attention to detail are what really set Kiwi Co. Crates apart from many of the toys that we have, gently pulling my kids playtime in the learning direction.

[00:16:32]The KiwiCo team really invests in and pays attention to learning outcomes.

[00:16:36]They recently teamed up with John's Hopkins on a study of the impacts of using KiwiCo crates in the classroom and found that teachers consistently reported improved student motivation, engagement, and confidence when using Kiwi Coates.

[00:16:50]KiwiCo Crates make amazing gifts for the kids and families in your life. And they make awesome learning experiences for kids of all ages. Use my code Welch Labs to receive 50% off your first monthly crate for kids three and older and 20% off your first Panda crate for kids under three. Big thanks to Kiwi Co for sponsoring this video. Now, back to Jeepa.

[00:17:14]Lacun's critique of VLA boils down to two main points. The difficulty of scaling behavioral cloning and lack of explicit planning.

[00:17:24]Let's hear Yan's take on behavioral cloning first.

[00:17:27]>> Oh, absolutely. Yeah, VANA are doomed. I mean, they they basically don't work really well. Okay. I mean, the only way to get them to work is to essentially collect tons and tons and tons of uh examples, you know, up or or or something else or or if it's in the digital world, it's just, you know, people paying with uh user interface and whatever. Uh and then just be do behavior cloning. And that's only practical for a very small number of uh applications and for applications where the degree of variability is not too high because those systems basically when they face a new a slightly new situation are completely helpless. So so they're they're brittle, right?

[00:18:11]>> Human demonstrations are a critical training data source for many VA implementations including the physical intelligence PI models.

[00:18:20]Training data sets are often captured using sophisticated controllers where the robot mimics the positions of the operator's hands. And Yan's point here is that this approach is simply not scalable. It's impossible to collect human demonstration data for every single variation of every single task we want the robot to perform.

[00:18:40]Now, it's important to point out here that VA models have been shown to generalize to new tasks outside of their training demonstrations.

[00:18:48]In fact, the breakthrough moment for VA models back in 2023, where Google's RT2 VLA moved a Coke can to a picture of Taylor Swift was a breakthrough because the human demonstration data did not have anything to do with Taylor Swift.

[00:19:03]So, to complete the task, RT2 had to connect the concept for Taylor Swift that its internal vision language model had learned during pre-training to the actions for moving objects it had learned later from human demonstrations.

[00:19:17]Since this breakthrough in 2023, the LA models have advanced rapidly. The physical intelligence team has demonstrated their robots performing a range of tasks not present in their human demonstration data, including taking Tupperware in and out of the microwave, replacing paper towel rolls, and loading and unloading air fryers.

[00:19:37]Now, of course, ability to generalize is on a sliding scale. While these exact tasks were not in the human demonstration data, similar tasks were.

[00:19:48]And if we ask a physical intelligence powered robot to do something too different from its demonstration data, it will likely fail.

[00:19:56]The big question here, the question that physical intelligence and many others are working to address is whether or not Va models will be able to generalize well enough beyond their demonstration data to make reliable and useful robots.

[00:20:11]Yan's second big criticism of VA models is lack of explicit planning. VA models are trained and deployed end to end. At each time step, a new set of camera images and robot joint positions come in and the model is trained to directly output the next set of joint positions.

[00:20:29]The robot then moves to these new positions. New images are taken and the process is repeated. This is wild when you think about what VA models can do.

[00:20:40]In this demonstration from physical intelligence, the robot has to do this intricate dance of handing the key back and forth between grippers to get it in just the right position to open the lock. The internal LLM is somehow reasoning about how the key needs to be held and is able to break this outcome down into this repeated shuffling maneuver between grippers to get it just right. The challenge here is that we have limited control of and visibility into this planning process. We're more or less left with a black box that takes in text instructions and camera images and spits out actions.

[00:21:18]I do not understand how you can even think of building an agentic system without a agentic system having the ability of predicting the consequences of its actions.

[00:21:31]>> Okay. And VA doesn't doesn't do that.

[00:21:34]>> Sure.

[00:21:35]>> Right. Airlines do not have role models.

[00:21:37]They cannot predict the consequences of their actions beforehand. They just take the action and then deluj as uh you know as some famous French kings said. So uh if you really want to build reliable agentic systems, they absolutely have to be able to predict the consequences of their actions so that they can plan a sequence of actions to do something. first of all to uh fulfill the task that they are being asked to fulfill but also uh perhaps to you know guarantee some safety guard rails. Sure.

[00:22:12]>> Right.

[00:22:13]>> And the inference process now becomes a search as opposed to just an autogressive prediction.

[00:22:18]>> Right.

[00:22:18]>> Uh so that's a world model that the whole idea of a world model. Unlike VLA, Lacun's approach to world models using Jeepa does not learn end to end and does not learn to imitate humans through behavioral cloning. Instead, the Jeepa architecture is used to learn an action conditioned world model that can then be used to explicitly plan actions.

[00:22:42]This is a task called push t where a robot is tasked with moving this T-shaped object to a final position marked on the table. The task is a bit trickier than it looks because it's difficult to predict how the T will translate and rotate based on exactly how it's pushed by the robot's endector.

[00:23:00]The robot's actions are limited to effectively 2D joystick controls. We can move the end aector up, down, left, or right. Let's see how Lacun's world model approach works on a simulated version of push T.

[00:23:15]Here the brown T is the target position and the blue T is the object that we push around and our control inputs move the yellow aector. First we learn a world model using Jeppa by taking images and actions recorded from push t. At each step we train our predictor to predict the embedding of the next image of the environment given the embedding of the current image and some action taken shown here using arrow keys. Here we're learning from trajectories recorded from humans performing the push task.

[00:23:47]This is a similar setup to the behavioral cloning we see with VLA. But the big difference is that the model is not learning to mimic human actions, but instead to predict what will happen next in the world given some action.

[00:24:01]Now things get really interesting. Given some initial configuration, we can pass this image into our encoder and get an embedding vector for our starting position.

[00:24:12]From here we can pass in any action we want into our predictor model and the predictor will return its estimated next state of the world based on our action.

[00:24:21]Now this prediction is still an embedding vector. So it's hard for us to understand what exactly the model is really predicting here. But for simple environments like push t, it turns out that we can train a separate decoder model that will map these predicted embedding vectors back to images of the environment.

[00:24:40]And remarkably, when we do this, the results make a ton of sense. If we pass in this starting position and a movement upward, theector in our decoded images moves upward.

[00:24:52]Here's a movement to the left, to the right, and down. From here, we can chain actions together at each step, passing the predicted new state of the world back into our predictor and passing in our latest action. So our Jeep trained world model is essentially a learned video game, a learned simulated version of the world that we can use to plan actions and observe their consequences.

[00:25:19]Using our prediction loop and decoder, we can compare what happens in our learned world model to the real thing.

[00:25:26]Here's 18 steps of actions taken in our learned world model and in our real push t environment.

[00:25:33]These match remarkably well. We do see some inconsistencies and drift, but overall our Jeppa model has learned the dynamics of our pusht environment remarkably well. Here's four more comparisons between our learned world model and the real pusht environment.

[00:25:51]The top frames show the world model generated roll out passing the output of a predictor back into its input after each step and the bottom frames show the real environment following the same actions. We generally see good agreement, but our learned world model does go off the rails sometimes. In practice, this instability limits how far we can reasonably look into the future when planning using these world models. The push team model implementation we've been experimenting with is from a Jeppa implementation called layworld model. Layworld model is trained from scratch on push t. As we've seen, our model inputs are raw pixels and actions. And remarkably from this data alone, our world model learns the physics of the environment, including the fact that our blue tea is rigid and movable. And the complex interaction between our aector and the T. Looking inside our Jeepa model's learned world like this is fascinating. It's like a learned cartoon sketch of the dynamics of the push t world.

[00:26:55]From here we can use our world model to explicitly plan a set of actions instead of learning to directly imitate human actions as we would with VA approaches.

[00:27:05]And then if you have this you can uh predict the outcome of a sequence of actions and you can by optimization you can figure out an optimal sequence of actions to arrive at a particular outcome. Right? This is classical optimal control. To plan a course of actions, the lay world model team used a very general planning method called the cross entropy method or CM. Given a starting image and a goal image, CM starts with a completely random set of actions. Here's 500 randomly chosen trajectories for ourector.

[00:27:38]From here, we use our world model to select the most promising trajectories.

[00:27:43]This trajectory bounces around a bit and then bumps into our T. Using our world model, we can predict what would happen if we were to follow this path. Note that the layw world model team groups steps of actions together into groups of five and passes these actions into the predictor all at once. So our first batch of actions moves our aector down into the right. And our world model simulation matches this behavior. From here, we can continue our roll out five steps at a time with each batch passing our embedding space prediction from our previous batch into our predictor along with our latest five actions. After our randomly chosen 25 steps, our world model predicts that our aector will rotate our t, not really moving it any closer to its goal state. To measure how much closer or farther a given trajectory takes us from our goal, we compute the embedding of our goal image and then measure the uklidian distance between our final predicted embedding vector and the goal embedding vector.

[00:28:47]From here, we perform the same roll out process for each randomly chosen path and compute the same distance metric for each path. Let's color each path according to its distance in embedding space to our goal image. Here's our best performing path. It looks a bit random, but if we visualize our decoded world model predictions, we see that this path actually bumps into our t twice, pushing it towards our goal.

[00:29:16]From here, our top performing 30 trajectories are grouped into an elite set, and the mean and standard deviation of this elite set are used to sample a new set of trajectories.

[00:29:26]This process is repeated again and again until we're left with a tight set of candidate trajectories and ultimately a final planned path.

[00:29:35]And what's really remarkable here is that our planning happens completely in the model's learned embedding space.

[00:29:42]The score we give each possible path guides the entire planning process and is computed as the distance between the final predicted embedding vector for each path and the goal embedding vector.

[00:29:55]We can now follow our planned path and see that our aector nicely pushes RT towards its goal.

[00:30:03]Our resulting system cleanly addresses Lacun's critiques of VLA. It does not learn by imitating humans. So the system does not need to see how a human would solve the task, but can instead find solutions on its own using its world model and an explicit planning process.

[00:30:21]However, while the architecture of layworld model is elegant and free from these concerns, the performance these models have shown to date is dramatically behind VLA on the push task. Lay world model can only reliably plan about five prediction loops in advance, limiting the model to relatively simple manipulations.

[00:30:41]When I'm trying to imagine a JEPA powered robot kind of doing a long horizon task like cleaning a kitchen for 10 minutes for example, right? Um I'm in my head, right? It's hard for me to imagine even in embedding space uh the predictor being able to see 10 minutes into the head moving around a kitchen that seems like uh longer than I would expect, right? Um does that is that where hierarchical starts to matter?

[00:31:02]What what are your thoughts on long horizon task with Jeepa?

[00:31:04]>> Yeah, you have the answer in your question. The answer to this is uh hierarchical models. Okay, so what's a hierarchical models model? It's one where uh at a low level you make detailed predictions.

[00:31:18]>> Mhm.

[00:31:18]>> But you don't but you don't predict long term because the more detail you preserve about the prediction, >> the more your prediction is likely to diverge from reality very quickly. Yeah.

[00:31:29]>> Right.

[00:31:30]>> And so you you train low levels in the predictor to make short-term prediction with a lot of details, >> which sometimes you need because you know you need to know exactly what's going to happen when you grab an object, right? you need to grab exactly the right way and things like this.

[00:31:44]>> So you need a lot of information but then if you want to make longerterm predictions then you can only do them with fewer details about what you predict >> right >> uh and so that you know the your your your prediction does not diverge from reality.

[00:32:00]>> What what would what would the interface be like between the layers of the hierarchy?

[00:32:06]Well, the the same kind of interface that exists between various layers of a deep neural net. That's exactly what >> Sure. Yeah. So, it's in some embedding space, the interface between layers. It doesn't have to be semantic or uh certainly not language, right?

[00:32:19]>> No language. I mean, your cat your cat can do hierarchical planning. So, you know, they don't have language, right?

[00:32:25]>> Right. Yeah. In Lacun's proposed solution, hierarchical world models, we can tackle longer horizon planning by simultaneously planning at different levels of abstraction, Jan and his collaborators recently applied a hierarchical world model approach to push t and other tasks and using two layers of hierarchy were able to extend the planning horizon in push t from five time steps to 15. Interestingly, the predictions from the higher level world model serve as sub goals for the lower level world model and planner >> and you can't plan a long uh action in terms of you know millisecond by millisecond muscle control.

[00:33:06]>> Sure.

[00:33:07]>> Mostly because you don't have the information most of the time like the example I use very often is >> if I'm sitting in my office at NYU and I want to be in Paris tomorrow.

[00:33:16]>> Sure. um I cannot plan my entire trip in terms of millisecond by millisecond muscle control >> right >> I don't have the information >> right >> you know in addition to the fact that it would be impossible to to do the the the planning >> uh so you go to a higher level of abstraction um you know a high level abstraction would be well I need to like you know go to the airport and catch a plane that's a high level plan right um and I have a sub goal which is going to the airport Um I'm in New York City. So so I'm go down on the street and have a taxi and then I have seup of goal going down in the street etc. And at some point in the hierarchy you have all the information you need and it's a task you are used to doing like standing up from your chair or walking to the elevator >> right and and do you think if we have the right architecture for the hierarchy then the like the hierarchy will be kind of learned just as like in CNN's you know kind of magically you know we'll learn this hierarchy of features. Do you expect if we have the right hierarchical JPA architecture then that will just become be emergent basically >> that's kind of the hope yeah totally that the system will you know discover the appropriate hierarchical representation by being trained >> to make short-term prediction at a low level and higher >> interesting >> longerterm prediction at a higher level >> right >> uh and and and and so the hope is that you know through this type of uh predicted prediction based self-supervised learning the system will will learn a good hierarchy of representations But it probably requires to train on kind of semiexpert trajectories like you you can't learn high level things if you train on completely random >> totally observations.

[00:34:54]>> Yeah. Interesting. Lacun's vision for Jeepa world models in the future of AI is well considered and compelling but it's still early for Jeepa. VJEPA 2 and VLJA give us some powerful glimpses into what the framework can do and show that the Jeepa approach is not incompatible with the current mainstream languageddriven approach to AI.

[00:35:18]But when we zoom out to agentic and robotics problems, Jeepadriven world model approaches are still quite limited and there are many unanswered research questions.

[00:35:29]30 years ago, as Jan worked on early deep learning systems to recognize handwritten digits, these systems probably felt pretty limited. Just as the push t demonstrations feel limited today. The fact that these core deep learning ideas could be scaled up to the powerful AI systems we have today is remarkable. Could Jeepa follow a similar trajectory? Is Jan's billion dollar bet on Jeepa completely right, part of a larger solution, or just a dead end? How will we know over the next you know two three five years if your this world model Jeeper approach is working? What would be a good next you know two three five years at at Omni Labs?

[00:36:08]>> So within uh within a year or two uh we'll we'll try to apply the the whole model planning etc to a number of uh industrial applications.

[00:36:20]>> Cool.

[00:36:21]>> Okay. And this is not necessarily a business model or to generate revenue.

[00:36:24]It's more to gain experience with sort of pushing this type of methodology into practical applications. And the ideal set of applications would be um essentially controlling a complex systems whose behavior cannot be reduced to a small number of equations.

[00:36:44]>> Okay? Because if you can write down the equations like you know a simple robot arm or even a humanoid robot, you can just write down the dynamical equations.

[00:36:51]you you need to identify a few a few coefficients but you can just write down the equations uh or you know you're NASA and you're shooting a rocket to go to the moon you can just you know you have complete dynamical model of the rocket and you can plan the entire trajectory >> right >> u but like what about a an entire jet engine or an entire airplane for that matter or um or a chemical plant or a power plant or a patient uh with you know a disease like say diabetes, right?

[00:37:24]what course of treatment um should you follow uh to kind of control the blood sugar of the patient and you know if you have a good predictive model of the state of the patient uh you might you might be able to design a a course of treatment uh or you know how would you uh uh tell a a stem cell to turn itself into a beta cell for pancreas to produce insulin right I mean there's a lot of complex systems like this simply cannot reduce to a small number of equations but you might be able to produce a phenomenological model of it from data and then use that to to uh to control it. Um and you know and it's true again of you know complex uh complex systems in industry or chemistry or or or or whatever right and there's a lot of really uh you know promising work in material science chemistry where where this kind of idea is uh is there you know you train a phological model of a complex collective phenomenon and then you use it to design new materials new catalyst for chemical reactions or new batteries you Oh, etc. Um, very promising. So, >> that would be the first applications and then eventually a few years from now, three, five years from now, uh, the hope is that, you know, we might become the main supplier of intelligent systems, whatever the application is.

[00:38:50]>> Amazing. Maybe we can talk again in a few years and we'll uh we'll see all the progress. I'm excited.

[00:38:54]>> Right. Exactly.

[00:38:57]>> If you enjoyed this video, check out the companion poster.

[00:39:02]We've been calling this graphic the web of AI. It follows the path to the current mainstream approach to AI, Lacun's alternative path to Jeepa and really nicely shows how discriminative, generative, and joint embedding approaches fit together. The bottom of the poster includes visual summaries of the models we covered in this video.

[00:39:23]VJA, VLJA, and layworld model. Our designer Sam used this really great texture on the Web of AI animations and we really wanted to retain this feel for the poster. We found this premium fine art rough paper from Canon that has this really great matte textured finish. It looks awesome. You can get the Jeepa poster on this textured paper or a more traditional smooth finish. You can pick up the Jeepa poster and the Welch Labs illustrated guide to AI at welchlabs.com.

[00:39:56]This two-part JEPA series clocked in at well over an hour and required hundreds of hours of research, writing, animation, and editing. To help us make more in-depth videos like this, please consider supporting Welch Labs on Patreon. We're finally planning some Welch Labs merch for later this year.

[00:40:16]All patrons will be able to vote on designs, and we're adding a new tier that includes early access to merch drops. At the $5 per month or higher level, we'll ship you a real paper cutout from a video. We typically ship what we've just finished shooting. So, if you sign up today, you'll likely receive a cutout from the Jeepa video.

[00:40:37]Huge thank you to Yan Lun and everyone else who helped make this series. I really hope we're able to interview Yan again in a few years and see how Jeepa progresses.

Videos Relacionados

Inteligencia Artificial

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Inteligencia Artificial

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Inteligencia Artificial

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Inteligencia Artificial

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Inteligencia Artificial

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Inteligencia Artificial

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Inteligencia Artificial

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Inteligencia Artificial

AI Doesn't Create Bias — It Inherits It

UXEvolved

176 views•2026-06-01

Tendencias

Why Batman Lets The Joker Live 🤨

zackdfilms

9222K views•2026-05-30

They're Complete Trash

penguinz0

558K views•2026-06-04

Paris is in SHAMBLES right now 😭

H1T1

4053K views•2026-05-31

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30