This is a brilliant but impractical experiment that mistakes visual imitation for actual computation. Replacing reliable code with probabilistic guessing makes for a great demo but a fundamentally useless operating system.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI Simulated OS Is AbsurdAdded:
Never ask a woman her age, a man his salary, and a developer their operating system. Because in the year 2050, the argument might not be as simple as which Linux DRO do you use? It might just be what training curriculum did you give to your operating system? Being hacked may no longer be from some sort of cool software edge cases. It'll probably become something lame like, "Oh, I gained administrator access by brainwashing the system to think that I am its father." No. And that might not even be a joke because you will need to raise your own system yourself. On the other hand, do you guys remember AI generated Doom? It's like an AI video generator that generates game footage of Doom and you get mouse keyboard inputs to interact with the game footage. So, it's not running deterministically based on ones and zeros, but it actually lives in the neural networks parameters. I had a video about it. You can check it out or if you have seen the AI generated Minecraft is basically the same idea.
But in the current latest form, Genie 3 published by Google, the qualities has improved. so much that the previous two examples just look like toy experiments and it can generate outside the domain of a specific video game or a world. So if a system like a game can be simulated entirely by AI within its parameters, doesn't that mean a computer system can also be simulated? Well, if you can doesn't mean you should. But you know what would that look like? So in the latest AI research news, this new paper called neural computer has shaped the earliest form of what I described earlier, which is an AI generated computer system. And before we take a good laugh at how absurd this idea is, we already know how big the AI industry will be in the coming years, as you can see from how the stock market is now slowly pricing in in the mass adoption of AI. And if you don't want to get left behind, it is still not too late to start. Whether you are a student, a software dev, or just want to pivot into AI, my latest project, Intuitive Aiacademy, is the perfect place for you to dive into the basics of LMS all the way to the frontier techniques with everything explained intuitively, ranging from LM architectures, MOE, Laura to our latest chapters, reinforcement learning, where we cover how RL works and how it interacts with LMS accompanied with our latest interactive visualizations to help you better understand its logic. So for those who want to get into AI or LMS, this should be the perfect place for you to dive into the technical parts without being intimidated by crazy looking maths. And right now we are offering a summer discount. So use the code summer for 25% off a yearly plan. Anyways, as bold as its goal sounds, the idea of what this paper defined as a neural computer is relatively straightforward.
It's basically a system that unifies computation, memory, and input and output in a learned runtime state. So if you look at AI Doom or Minecraft, they are basically a miniature version of systems that requires these three things, but now being compressed into the parameters of an AI, more specifically under an architecture that is commonly being used to generate AI videos. But why AI videos? Well, the general idea of using AI as a simulation definitely started a long time ago.
According to the renowned researcher Jorgen Smith Huber, who invented LSTM and is also the supervisor of this neural computer paper, he was also the first one to pioneer the idea of the world model in the 1990s. Well, whether if that's true or not is up to the reader's interpretation, but honestly, he is a citation machine using Twitter like an academic journal. Anyways, it only became a big deal, at least for me, when OpenAI first introduced Sora with people realizing that a consequence of AI video generation being so good is that they are essentially simulating physics in both 2D or 3D. Not only that, it can basically simulate anything, which I mean anything that can be visually observed. Like in the next few years, an AI video generator can probably generate a video of a person presenting a PowerPoint slides end to end, including generating the PowerPoint's content as it can simulate the world. But what even is the benefit of having these world models? Well, a very creative application of this is how it's being applied to robotics or autonomous driving like Whimo to simulate diverse environments cheaply and train the autonomous systems on those environments. But in the case of simulating games or even an operating system, it's really hard to imagine the benefits of it. I mean before we judge too hard, let's first take a closer look at how this neural computer works and what it can do so far. So similar to an AI video generator, its underlying architecture is mainly diffusion transformer which is basically means it also uses transformers like LLM, but instead of having the key objective be next token prediction, it is doing diffusion. And if you don't know what diffusion is, the idea is to repeatedly dn noiseise a noisy image into the target where the direction of the denoising process is learned during training. So you can slowly see a cat when the model is being conditioned on the concept of a cat. But instead of images, it's being done to create videos. And specifically for this neural computer setup, the video is not simply visual output. It is actually the interface itself as it is trained exclusively on operating system recordings. So the model would be generating the screen of a terminal or a desktop frame by frame as if it was a running computer session. And instead of calling them frames, they call this a latent runtime state. Because at every step, the model not only takes in the current frame, which is what the screen looks like right now, but also the user's input like typing, mouse movement, and clicks. This updates the internal state, then generates the next frame. So basically a continuous loop that updates the hidden state and renders the next screen. Then if you want to pull up the terminal and execute commands through the OS, what the model does is it has to learn to simulate the entire interface dynamics inside its own weights and latent state. In the paper, they train two different models where one is basically easier as the model only needs to learn to simulate a CLI while the other is the entire OS. For the CLI, it basically needs to understand the terminal will be a place to display the user's input. And when the user executes the line, it needs to be able to understand the user's input command and rules forward to generate future terminal frames. All of these already sound extremely hard to achieve.
And in their demo, even though we can't see how good the input latency is, you can still see that it is able to write out the SQL commands pretty well and it executes some other basic commands. But if you do look a lot closer, some of these numbers or letters are pretty wrong. Like the loading bar versus the percentage is just very off. Same for some letters. But if you look at it from pretty far away, it does look really believable. As for running the simulation on the full operating system, it becomes a much harder problem because now the model is no longer just dealing with text laid out in a line. It has to understand an entire visual environment where every pixel can change depending on user actions. So instead of just predicting the next line in the terminal, it now has to predict how a desktop behaves. That includes cursor movement, clicking buttons, opening menus, dragging windows, typing into fields, and all the tiny visual feedback that comes from it. So if you move your cursor, it has to move the cursor in the next frame. If you click something, it has to trigger the right UI change. If you open a menu, it has to render the drop down correctly. But here's where it gets significantly harder than CLI. In a terminal, most of the structure is symbolic and constrained. Text follows rules. Comments have predictable outputs, and layout is relatively simple, and the CLI's window is fixed, too. But to simulate an operating system with a guey, everything is much more continuous and spatial. The terminal should also work if it was moved to different places and small mistakes in positioning, timing or rendering can completely break the illusion. So, not only does this increase the difficulty for the model to learn the causal relationships between actions, but monitoring the cursor adds a whole other level of difficulty. In one of their experiments, once they explicitly render the cursor as a visual object and supervise it at the pixel level, performance jumps massively up to around 98.7% accuracy from only 8.7% if it was just simply observed. This is achieved through cross attention. And the intuition here is that instead of forcing the model to blend the action signal inside the same stream as the pixels, you give it a separate channel for actions and then let every part of the model directly look at the channel when it needs to. So inside each transformer layer you now have two things. The visual tokens which represent the screen and the action tokens which represent things like cursor movement or clicks. So when each visual token produces a query and attends over the action tokens as keys and values. This means that every pixel region in the image can selectively pull the relevant action signal. So if the cursor moved to the top right, the model would know that only the regions near that area need care while everything else can mostly be ignored. This is much better than just concatenating actions at the input because the model would have to carry that information through many layers and hope it gets used correctly as cross attention now just directly injects the signal into the computation at every layer so the model can repeatedly align what happened with what should change on the screen. But even then it is still far from perfect.
Clicking English on Wikipedia does create a reaction but it displays words that are definitely not English.
clicking the color gradients in the image editing tool, but the color sliders just go hem because it doesn't understand how it should act or just in general not having the correct reaction when an action is done. There are just so many more things or interactions for the model to learn. But conceptually, what the paper called the complete neural computer should be able to replace the operating system program execution and rendering pipeline using a learned state update plus a rendering loop where the latent state will be handling the memory and the execution context. The transformer will be the compute step and the video frames will be the input and output interface. So the model itself is the computer, but from what I've shown you so far, this obviously does not look that great. It's only very good at looking at a computer, somewhat good at reacting like a computer within like a 5second time frame, and weak at actually computing things symbolically or reliably, let alone having any state persistence. I highly doubt the context window is even that long too as the video demos they have only lasts about 5 seconds. So now addressing the elephant in the room, why would anyone need a operating system that when you open a picture, it might just give you something else? Because for a system like this, the most important thing is probably state persistence. And to be honest, I don't know how we'll overcome it because it's a built-in limitation of AI models. So unless the transformer in general overcomes this hurdle, then achieving what they call a complete neural computer would be pretty much impossible. And even if this is achievable, the way we use this system will definitely not be the same or as we expect. For example, installing a program will no longer be downloading binaries and executing them on an operating system. It becomes something closer to teaching the neural computer a behavior. Like if you want Photoshop, you are conditioning the model until it knows how Photoshop behaves and can reproduce that interaction loop on demand. And once it's learned, if you provide any screenshots of Photoshop, it can boot up that scene without any memory usage nor loading, which sounds really cool, right? But the bigger idea here is not just faster loading. It's that the whole notion of software starts to dissolve. Because in this setup, you're not launching Photoshop as a separate process. You are just shifting the model into a different behavior. The app is not something external anymore.
It's something that the model has internalized. So instead of opening a program, you're essentially conditioning the next few frames into that program's dynamics. And that's why the concept of a program could change because no more blocks of code sitting on a disk. It becomes something more like a reusable capability that can be learned, invoked, combined, and updated over time. So you could imagine a future where you don't install Minecraft. The model just learns how Minecraft works. And when you want to play it, you can just load a screenshot of it and you basically don't open a file. You basically reconstructs it from a learned representation. All can be done under a fixed memory and compute. And there are also questions like how does it access the internet?
Would that be another conditional input just like the cursor? And is it truly free of memory because the model still needs a computer system to run on in order to achieve all these deterministic behaviors within a non-deterministic setup. actually just sounds like we're doing things for the sake of doing things. And of course, all of this only works if the system can reliably hold on to state. If it learns Photoshop but behaves slightly different every time or if reopening a project gives you something inconsistent, then it completely breaks the illusion of being in a real system cuz it'll just be hallucinations. So, right now, we can probably build extremely good short-term simulators, but the long-term goal is more about how to build a system that can store, reuse, and guarantee behavior over time. As to why would anyone do that? Well, I guess when the time's up, we will know. But this research is definitely paving the basic definition and foundation for that because if you don't define something, then how are you even going to achieve it? And if that gap gets solved, then yeah, the way we use computers would fundamentally change. And we'll get not just faster apps or better interfaces, but shift from executing programs to invoking learning capabilities, which sounds pretty fun. And yeah, that's it for this video. So, if you like today's research paper review, definitely check out my newsletter where I cover the latest and the juiciest papers weekly. On there, you will be completely up to date every week on the cool new AI research. So, you don't have to wait for my videos because I am always slow at making them.
And thank you guys for watching. A big shout out to Spam Match, Chris Leoo, Degan, Robert Zaviasa, Marcelo, Ferraria, Proof and Enu, DX Research Group, Alex Midwest Maker, and many others that support me through Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











