Predictive world models, which learn conceptual representations of the world similar to how babies learn through observation (like object permanence), are essential for enabling practical XR and robotics applications on low-power devices, whereas generative world models that predict pixels are insufficient for real-time on-device understanding despite their current popularity.
Inmersión profunda
Prerrequisito
- No hay datos disponibles.
Próximos pasos
- No hay datos disponibles.
Inmersión profunda
Harvard XR : AI World ModelsAñadido:
Okay. All right. Now, it's my great pleasure to introduce our second speaker, Matt uh Misnik. Uh Matt is a senior entrepreneur with more than 15 years of experience working at the intersection of computer vision, spatial computing, and apply AI. He previously led business at Liar, one of the earliest augmented reality platforms, and later co-unded 60.AI, AI, a company that pioneered the concept of AR cloud, enabling uh mobile devices to reconstruct and reason about physical spaces in real time. After its acquisition by Nantic in 2020, 60.AIS technology became part of the infrastructure behind largecale AR mapping. Today, Matt leaves a flagship product and for computer vision developers with a focus on predictive world models and next generation seeing understanding tools. He is also a frequent speaker and writer on spatial computing, physical AI, and the future of machine perception. We're very honored to feature uh Matt today. Please welcome him.
>> Thank you.
Um yeah, I've spent like 15, 16, 17 years or something now working in this space. And I loved how the theme of this event was uh from pixels to voxels. Uh my last company 60AI was all about turning the world into voxels. And Niantic is doing an amazing job of doing that. However, um the theme of pretty much everything I've worked on over the years has been what are the things holding XR back, AR back from the sci-fi experience that we all kind of imagine.
And voxels aren't enough. You know, you can capture the whole world, but ultimately you've got to really start to understand what the world means. Like you got to be able to understand the world the way we are. And question now and it's a very exciting topic in AI in general is this idea of world models and I want to use this talk to sort of answer the question is like will world models make XR happen and the answer to give you a jump ahead is no and yes I'll explain why um XR AR has been a it's a pretty simple thing we just need to have these devices that can sense and perceive the world, you know, take all the sensory information in. Then we just need to figure out, understand what it all means, you know, and understand ourselves and then it improves our lives. It sounds pretty simple. Um, but although the sensing the world and seeing the world is I don't know if it's completely solved, but we've got amazing devices, amazing senses, that's on the way to being solved. when it comes to understanding, we're kind of at this point. Uh don't people have seen this scene and if you've tried any of these AI glasses or XR glasses and said, "Tell me what I'm looking at." We're barely beyond the point of being able to tell if something is it a hot dog or is it not a hot dog. They can't really understand what they what they see.
uh the best models, the best systems out there, they can recognize more than hot dogs, but it's pretty much just a long list of objects that they can recognize.
They don't really understand. And beyond that, they really only understand things where there's text associated with that in its in its training data.
The good news and the bad news around AI is it has changed everything. It's obviously fundamentally changed how we work. it's, you know, sparked this insane boom. Um, but what AI means today is large language models, not just chat bots. Like everything today is a language model to some extent. It's based on a generative AI architecture.
And they start by, you know, predicting the next token. So when that comes to a chatbot, it predicts the next word. You know, if you saw Arvy's talk earlier, it it's predicts the most probable next word that's going to come. So, when you've got text in English, it's 150,000 or so English words. You need a lot of data to predict that. Then we started to get image models and you're starting to predict every pixel on a screen where each pixel is a token. The models needed more data, more training, more compute.
They're more expensive. Um, but they work pretty well. You then get into video models uh like Miniax was just showing us. Again, you now have the pixels, but you have number of frames per second and you've got to understand, you know, frame to frame what those pixels mean. Again, more data, more training, more compute. And then we're getting to now, you know, these world models. And when people talk about world models in this context or 99% of what people talk about is this idea of a video model, but it generates immersive 3D video that is physically correct over time. So if you walk around that space, it doesn't jolt around. It doesn't change from frame to frame. If you hit something, it behaves, you know, like a like a game engine would behave. And they take even more data. They take, you know, as well everything that video data needs plus all of the physics data about the world and then when you want to get them into understanding, you know, the real world itself, you go from like a 150,000 or so English words to train on to everything that might possibly happen in the entire physical world at any point in time to try and predict what's next. And you get this problem that is an infinitely large vocabulary of things to try and train your language model on.
And these models get so big and so expensive and they still really don't capture everything. And you then get new industries put you know rising up to try and um solve corner cases through like reinforcement learning or simulated data or trying to think of all the things you know and there that's how you know successful products today like you know Whimo Whimo cars are kind of trained with this approach but it's not cheap and it's not something that's going to run in my glasses anytime soon.
The bet that everyone's making is this idea of the bitter lesson, which is that we all think there's a smarter way to do something, but it turns out if you throw more data and compute at it, it kind of solves the problem. And that's held, you know, all the way through this, you know, this um, you know, approach so far. So, um, that's kind of the the state of the world today.
The real question though is almost you know philosophical and it gets back to the the term language. You know that word and that approach is the question that we really need to be thinking about and it's the question that is really at the forefront of AI research right now.
Can we understand the world without language? Is language enough to understand the world you know in and of itself? Um, language models are trying to say, yeah, if we just have language, if we just have these tokens that represent things, we can figure out how they all connect to each other and language is enough. Um, but you know, this has been a philosophical question for a while and um, philosopher called Martin Haidiger, you know, pushed this idea that language is actually a reflection of reality. It's not the understanding of reality in and of itself. And it's the idea of, you know, I think Arvy mentioned a hammer before, but can you actually understand what a hammer is just by describing it and the words around it? Or in order to understand what a hammer is, do you actually need to pick it up and and hammer something with it? And the bet, the belief that I have is that that second approach that you need to actually have a direct conceptual understanding of something to actually understand it is what we need for models to be able to understand the world, not just a a linguistic description of them. So it's kind of fun as an entrepreneur that philosophy is actually starting to become our road map at the moment. These aren't like theoretical abstract conversations anymore. They're actually like, look, what are we going to build and what's the algorithm and how does this stuff work? Um, but it's gotten really fun, especially if you've come from a XR background in that we do need a totally new type of AI company, a whole new approach to building AI.
It needs to be different in how it learns. It needs to be different in what it outputs. And but let's also call it a world model. Let's call it the same thing as these other models. Um, but this is now a predictive world model.
So, I mean, for everyone who's been around XR for a while, we all love this idea of is it XR, MR, VR, AR, spatial computing, mixed reality, like all these words kind of mean the same thing. When you hear the word world model, just think of it as XR for AI in that it's a vague term that people, you know, apply it to be whatever they want it to mean.
But what I'm talking about is a very specific thing. It's this idea of a predictive world model. And what that means first talk about generative world models. You know what they mean is predicting the pixels of what comes next. They're like a video um like a video model but in that the output of these models is more video. You know, the video that's consistent with how the world works. Um that's pretty much all they output um by design. And they are good for a few things. They're good for maybe turn out to be good for entertainment, for gaming, film, all that sort of stuff. Jury's still out there on whether that works, but seems to be there. But the thing where they are gaining traction is uh for creating simulations. So you can tell your world model to create a real world scenario and then you put your virtual robot into this virtual scenario and you train it on the simulated data. And that's why I showed this idea of a snake eating its tail. They're building these models to create more data to train virtual devices which feed back in to create more data and trying to solve that problem of simulating and training on the entire world. Um I don't believe that's tenable and more and more people in the you know the research the AI research community are kind of swinging around to that point of view.
The type of world model I'm talking about is this idea of a predictive world model. Um the technology the approach underneath it has been uh largely invented and definitely championed the most by uh Yan Lun who was recently head of AI at Facebook. He's just started a a new startup called AMI with about a billion dollars in seed funding. Um and the technology is called a jeeppa if you look up ja and it's a new way of training and the metaphor is that instead of learning pixels about the world and how these pixels are represented and sort of match matching and predicting the next pixel it learns concepts it learns you know directly in the latent space of the model and that is very very similar to how babies learn you know if anyone's assume everyone's at least seen a newborn baby. They just kind of sit there with no language, no ability to communicate. They just sit there, they watch the world, you know, they're effectively watching videos of the world. And over time, they see things enough times. Maybe the family dog comes into the room enough times and they sort of get the idea that that's a thing that I recognize. And I get the idea of how it behaves in the world. I get what it can do and what it can't do.
And I I recognize it even though I don't even know what words are. I just have this representation of of a dog and the physical attributes, the conceptual attributes of it. Um, and it builds this model not just of what things are, but of how they behave and also of how the physical world behaves. So, if you've ever played peekab-boo with a baby, new, you know, at some point in its development, when you cover your eyes, you're gone. The baby thinks you've disappeared and then you show your face and it's like, "Oh, I'm back." You know, and that's it's funny. It's exciting.
You're gone. You're back. You're gone.
you're back and then one day it realizes that hang on a minute, you're still there the whole time. You know, you've never left. It gets this idea of like object permanence. Things are there even if I can't see them. And that type of learning is a very different way of learning to the way language models learn, whether it's video, world model, whatever. The difference is in something like a self-driving car, it might be going down the street and there's a truck parked on the side of the road and a kid comes down a driveway on a bicycle. The car will see the kid. It'll recognize there's a kid coming on a bike. Then the kid goes behind the truck. They're out of sight. For that period of time in the language model's mind, that kid does not exist anymore.
They've gone. They've played peekab-boo.
It's disappeared. And then it pops out again in front of them. And then they've got that fraction of a second from when they reappear to react where a Jeopard style model which understands this idea of like intuitively understands the idea of object permanence will say oh just because I they were moving they were really there but just because I can't see them doesn't mean they're gone. I can predict that they're about to come out from behind the truck. And so the device, a car in this point is able to have an ability to predict what's about to happen next in the world through this intuitive understanding of what things are, how they behave, and how the physical world works. That's kind of the fundamental difference in in learning.
Additionally, they learn at a um like a higher level of abstraction about the world in that if I asked any of you to to think back about uh I know your childhood home where you grew up, everyone will be able to remember that. But in our minds, we don't have a pixel perfect 3D image of that home. We just have a bunch of concepts. you know, maybe what our bedroom was like, what it smelled like, cooking in the kitchen. But there all these like conceptual ideas that don't take up a lot of data where if you asked a language model to say, "Hey, show me the house I grew up in, it would generate every single pixel, right? And that takes both a lot more inference compute to you know generate that as well as more data to train because the Jeppa models are only remembering and storing and training on these concepts which are like a a compression of reality into a conceptual form. So the models are not only cheaper to train, they need less data to train, they're much more um efficient in their inference. So they can run on much lower power hardware.
So that's all really good. But then there's one piece missing that you know like in this image you have all this conceptual understanding but you still need to be able to communicate with it.
And that's where language models come in like you put you train a language model to sit in front of a Japa model and that lets you have like input and output to the to the brain to the world model. And that's the piece where it's the combination of the two where we think things get really really interesting because you could take in image from a camera. You could take in words. You could take in sounds through the language model, then map them to the concepts and have some sort of understanding of what's going on and then you know communicate that back you know to the either back to the the robot system or back to a person.
So what does it mean uh in terms of XR and I put an asterk next to XR? These world models are really going to change how everything in the world, every device that is in the world works. You know, I mentioned self-driving cars before. That could be a robot. It could be a humanoid robot, a dog robot, a a vacuum cleaner in your house. It could be the drones that are flying around doing all sorts of stuff. as well as, you know, glasses that need to understand what they see in order to then communicate back to you through some content on your display or through some audio feedback or something like that. The really cool thing, and this is a problem that god we wrestled with for years and years at 6D, when that idea of being more robust, you know, it's just two little words, but anyone who's worked with computer vision in any way knows how easy it is to break those systems. You can get them working in something a narrow environment that, you know, they recognize hot dogs and then you show it a hamburger and it breaks. That could be for robots. You take it out of the factory that it's trained in, it breaks. You put it in at nighttime, it breaks. It rains, the model breaks. Like all this stuff to get a general understanding of the world that works anywhere is a really, really hard problem to solve. It's why a really difficult problem for robots today is crossing the street. It's something that like a 5-year-old kid, you show them like maybe 10 times and they know how to cross the street. A robot with god knows how many, you know, billions of dollars and hours of training, whatever, still can't do that. Or maybe it can cross the street in one place with just at the crosswalk, but if you drop that into a different country with different traffic laws and dirt roads instead of tarmac roads, it it's lost. The difference is the child has a conceptual understanding of what a road is. And it doesn't matter whether it's particular sidewalk here in Cambridge or a dirt road in Africa or some beach track, you know, near where I grew up in Sydney. Um that conceptual understanding means that the systems can understand things irrespective of the background um irrespective of the environment. It understands the it recognizes the concept.
Um, it's something that, you know, Avi also touched on before with LLMs, how they recognize that, you know, he showed you the the the machine with the balls where it always picks like the most likely thing. Jeppa learns the other way around. It learns what is normal and then pays attention to what's unusual.
It's exactly the way a baby I talked before sitting in its crib. It's looking at the room. A dog walks into the room for the first time and it'll it'll startle and go, "What's that?" and all of its attention goes onto that. It's exactly how we learn. You know, if I'm on a video call and there's just my background behind me and you're chatting away and then a clown walks behind me, all of your attention is going to go onto this weird thing, this outlier thing that's happened. Jepper learns the same way. And again, it's more about how the real world works. You know, we want our devices to naturally pay attention to what is outlier situations.
I mentioned how they're smaller.
The big thing of this, the end result of all of this is that these devices will these models will run on our devices today on our hardware and they have this general understanding of the world. That ability to run on your device is it's everything. No one wants your Tesla to be calling back to the cloud to check if it's safe to change lanes. You know, these things have to happen in real time. Our our view is that the AGI future that we're all going to hopefully get to one day will split roughly into like type one and type two thinking where the type one quick decisions how do I move around the world how do I do simple things how do I achieve tasks that will all be done with ondevice AI that's small real time conceptual all of the pondering all of the planning all of the big decisions all of the thinking through and researching things that will all be passed off to the cloud and handled by these giant frontier models that run in the clouds.
So where are we today? Uh in all this you know direction um these generative world models are are what people mean today when they say world models. Um they're getting billions of dollars in funding. They are generating cool models. There's companies like you know World Labs that are doing some great work. Luma, Google, OpenAI, they're all building these interactive 3D worlds that can be prompted. There's some products coming on the market. It's kind of happening now. Question is like will it actually accomplish you know the the big goals where predictive world models are just getting attention right now. We can recognize objects. We can recognize actions. Starting to recognize intuitive physics. But the next exciting thing is how do we start learning the relationships between things? You know a dog can walk, a dog cannot drive a car.
People can be friends. All these types of things are sort of still the interesting part of building these models of how the world works and where things get really exciting and and philosophical quickly.
So my bet if you ask me, you know, what to do, what am I doing, where am I putting my time and money and energy right now, the bitter lesson, even if it's right, it's still wrong in that even if it does achieve all this understanding of the world, the way of achieving it with more and more data and more and more compute disqualifies it from the fact that has to run on device. Very low power devices, low compute, low wattage.
They'll never ever that approach will never ever solve the problem of this ondevice type one you know thinking about the world. I think that predictive predictive models are going to be what makes XR happen. It's going to be what makes robotics happen. What makes a lot of these smart devices actually become useful in a step function way in the same way that chat GPT made text useful and interactive.
And to back that, I've started a new stealth company. It's called Primate Intelligence. And we're building these models for, you know, a Jeppa based predictive world models to run on devices. And the reason we called it Primate Intelligence is summed up by this great quote by the head of uh robotics at NVIDIA is that and this quote came out after we named the company. So I was quite happy to see that is that there already exists that question about can we understand and interact in the world without language.
It's already been proven that there is an existence of that in that primates, apes, they have a very sophisticated ability to act in the world. No real tiny ability, you know, with language.
So, we want to solve that problem of enabling building this type of AI that can understand the world in the same way that primates can and let other people solve the language approach.
That's it. Thank you.
Happy to take questions if we have time.
>> Uh thank you. I I I worked for Lun in 2017 and I was wondering what he was up to. But um uh the um do you think that um enough is understood about human consciousness and and human learning to build a model to to to train an AI?
No. Um I think that AI is its own thing. It's not a we're not trying to build human consciousness in silicon. It's going to be something that behaves similarly to how conscious humans behave. But whether we can say it's actually conscious, I mean the the example is like half of all of us, majority of our decision-m comes from our gut, from our gut biome, you know, that drives us in in the things we do. It isn't something that happens in our brain. we we have barely comprehend like how we actually um understand and act as biological creatures and AI like silicon machines will will never ever be that. So they're always going to be something different.
Um, I think about it as can they be useful and can they do things that help us and can you know the we've evolved to you know act and interact with the world in in ways that suit us like the built environment is all set up for our biology. If we can come up with machines that can function in that environment, then hopefully with that power and applied responsibly, like we can come up with devices that help us in lots of ways. But the question of can AI replicate human consciousness, I I just think it's a apples and oranges, you know, I don't think can ever do. So, but I'm I'm guessing, everyone's guessing like I don't know.
Uh just a quick question. So what do what's your take on potential applications for multi-ensory world models? Like if you were to really think about human applications of touch, alactory, other >> um we are starting with just the sense of vision like like computer vision. um you know we're training it based on what it sees. Obviously the idea of like multimodal input and sensory is huge. Like if you want to um I know control a humanoid robot, you want it to not just understand maybe audio commands, but you audio context, you want it to be able to understand like force and gripping and motion and um all of these types of inputs, you know, not even just like human inputs like you take like multisspectral imaging for example and see into infrared and all this stuff. um the models it's not that big a deal to you know take them as inputs and train the model in a in a multimodal way. what the science project is that no one's really understood yet is is kind of alluded to before like we understand sensing we understand like objects we understand start understand actions in the world but the relationships between things like we can you know a dog can't drive a car a dog can run a dog cannot fly that's all straightforward but when you get to sort of like what's right and wrong like what's a when I walk into a room what should my model pay attention attention to versus what should it not pay attention to like a a fireman will pay different attention to things than a a chef would when it walks into rooms. So those type of questions around taking all that input and then making good judgments about what the model should you know pay attention to is like that's the that's the sort of fun ideas to think about that that nobody's no not even in research has really got clear ideas on how that stuff's going to play out yet. That's That's next.
All right. Well, thank you. Thanks everyone. All right. I'm around.
Videos Relacionados
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











