Install our extension to search inside any video instantly.

Tokenize Time: Inside the Interaction Model Architecture.
Added: 2026-05-14

362 views3214:20engineerpromptOriginal Release: 2026-05-13

Thinking Machines has developed an interaction model architecture that tokenizes time into continuous 200-millisecond micro-turns, enabling real-time simultaneous tracking of text, audio, and video while maintaining conversation state and generating responses. This encoder-free early fusion approach fundamentally differs from traditional multi-turn conversational systems by processing inputs in continuous time-aligned chunks rather than discrete turns, allowing the model to seamlessly manage dialogue, perform multimodal interjections, and measure elapsed time without external tools. The system uses a 276-billion parameter mixture of expert model with 12 billion active parameters, featuring a two-tier architecture where a fast interaction model handles real-time responses while offloading complex reasoning tasks to a more capable background model running asynchronously.

[00:00:00]Okay, so we need to talk about interaction models from thinking machine because they are genuinely impressive and I think they have the possibility of of really changing the way we interact with computers. Uh I unfortunately I haven't seen much coverage on YouTube.

[00:00:15]So I thought to create this video my focus is going to be more on the technical details because that what is interesting to me. Uh but there are some pretty interesting demos that I'm going to show you in this video. But first, let's address the elephant in the room.

[00:00:32]Some people are comparing this with the GPT40 that was released by OpenAI a couple of years ago. Now, we have a second version of that. But I think the technology that thinking machine is using is very different. The way they have that trained a unified model that can tokenize time is a very different approach than what things like GPD4's advanced voice mode or even things like Gemini uses. And that's what I want to focus in this video. By the way, this person that you see here in the GPD4 demo is the same person that is showing the interaction model demo in thinking machine. Most of the team is coming from either Deep Mind, OpenAI or I think there are a couple of people from Enthropic but these are the people who have worked on similar systems in the past and they are building the next version of these. Okay. So I find this demo pretty interesting. So, let's watch it. But it also shows you the capabilities of this model as well.

[00:01:34]>> You're going to translate what I say into HR speech. Okay.

[00:01:38]>> Exactly. I'll be reframing everything you say into uplifting professional language instantly.

[00:01:44]>> Hi, Alex. I cannot stand your lateness.

[00:01:47]>> Hi, Alex. We'd love to explore opportunities to enhance your timeliness and morning engagement.

[00:01:52]>> You cannot expect the rest of us to rearrange our workflow. Let's collaborate on a schedule that ensures everyone stays aligned and included in those key.

[00:02:00]>> Okay, so a couple of things to notice here. First of all, the model is interacting with the user in real time and as the user is talking, the model is able to not only produce responses, but it's also able to keep track of what the user is saying, right? So it's not a multi-turn conversation where in the first time the user asks something the model responds then the user asks something subsequent the model is going to respond again. But this is very different and let me explain how this works. But before that let's look at what a usual full duplextoise aentic system looks like. Now, usually people combine multiple different components together to create these interactive systems that feel like realtime responses from the single model. But in fact, there are multiple different components. So in a traditional system, you would have a voice activity detection which detects whenever a user is speaking or they start speaking, right? Then you have a speechtoext model that is going to convert your speech into text that is sent to the LLM which is basically the brain of the system and then you have a texttospech component and on top of it is session management or orchestrator layer and then there is going to be external service services which basically keeps track of when the user starts talking right so normally there are a lot of different moving parts okay and that's what enables wise uh multi-turn conversational systems.

[00:03:34]So a simple view would be the human has an input the model produces an output there's a second input the model produces another output right and if you think about it in this voice to system you talking about input to the whole system the whole system generating an output right thinking machine took this to a next level in which they tokenized time itself so they have time aligned microturn now let me explain this by playing this so they keeping track of text, audio and video simultaneously. Everything is tokenized into 200 millisecond chunks and those chunks are going into the system. Now they're calling these micro turns. So the system actually keeps track that every 200 milliseconds if there is something said by the user that it needs to respond to and it's able to keep track of the conversation history and simultaneously is able to generate outputs as well. So not only it's tracking inputs but it also doing internal uh state management and response generation.

[00:04:46]Now this 200 millisecond makes it almost real time. So what are the different capabilities that this type of time tokenization enables? Right? So some of the capabilities are seamless dialogue management. Now here's one quick example.

[00:05:03]>> I'm going to tell you a story and whenever you hear an animal word, please count the number immediately.

[00:05:09]>> Gotcha. I'll count them out as you go.

[00:05:12]Let's hear it.

[00:05:13]>> Okay. Last weekend I drove down to South Bay to visit a farm.

[00:05:23]Okay, so now this is the most important part, right? Since it's doing this micro turns, all right, it's able to keep track of the user intent. So in this case, it's expecting that the user is going to continue this conversation and that's why it's not responding.

[00:05:38]Um on the way there I saw a deer >> and when we got to the farm we watched a demonstration of sheep sharing too.

[00:05:47]>> Then on the way back we >> right. So it has this internal capability to manage the state of conversation.

[00:05:55]Now this is multimodal and it can do verbal and visual interjections. So this one is really a funny demo.

[00:06:03]>> Okay, I'm doing some work. Let me know if I start to slouch.

[00:06:08]I've got you. Sit up straight and you'll be golden.

[00:06:13]>> All right. So, >> you're starting to slouch forward. Try pulling your shoulders back. Much better. You're upright again. Like that can strain your neck.

[00:06:21]>> Try keeping a >> Now, one thing they highlight is time awareness. Right. So, if you ask advanc mode in charge to keep track of time, it's not able to do it. It has to use an external tool. But since we said that they are tokenizing time itself. So it can measure how much how many tokens have passed and then it can actually tell you how much time have passed.

[00:06:44]Okay. So this is a block diagram of the overall system. There are two different components. One is the interaction model. This is the real time model that interacts with the user and and it's extremely fast. But it's not the smartest model. It's actually only 276 billion parameter model which is relatively small. However, they created this whole architecture where it can offload tasks to a background model which is running asynchronously and which is much more capable. So if the model is interacting with the user in real time and it needs some more reasoning intensive or knowledge intensive tasks, it can just offload it to this background model which is a lot more capable. it has access to a lot more tools that is going to perform all the operations asynchronously and then feed that response back into your interaction model and it can generate responses. Now you can think about this and it's basically hand off to a more capable model or routing and you can use the same concepts in building agentic systems. Now in order to support this they had to really rethink what type of infrastructure they will need and that's I think one of the most impressive thing about this release is they had to completely redesign the whole infrastructure around the model not just the training but how they would do inference. This is encoder free early fusion, right? So rather than processing audio and video through a large standalone encoders, we opt for a system with minimal pre-processing. One thing which I really like about this blog post which essentially is a technical research paper they have highlighted previous work in uh the domain and especially this is built on top of some of the open-source work. Uh especially they highlight models like Mushi from Qout. So it's actually a genuinely pleasant paper to read. So in terms of architecture they're taking the text tokenizing the text uh generating embeddings. Then for videos and um images they're taking 40x40 patches passing on to u MLP layer and uh for audio they are u computing mspectrum features and then creating bag of embeddings. Everything is passed on to a transformer layer. uh and this is tokenized into that 200 millisecond token windows. Right? So that's basically the time tokenization.

[00:09:15]Now for inference they say that at inference time 200 millisecond chunks require frequent prefill and decode of small sizes each having to meet strict latency requirements. So they say that existing LLM inference libraries are not optimized for frequent small prefills.

[00:09:31]they often have a significant amount of overhead per turn. To address this, we implemented streaming sessions. So this is a new innovation that they had to do in order to optimize the inference otherwise it would be way too slow. Now the model that they are showcasing, they haven't released the model. It's not available through an API yet. They are calling it TML interaction small. And as I said, this is a 276 billion parameter or mixture of expert. This is very different than other frontier labs because they usually don't share information about the size of the model.

[00:10:07]But here it's a 276 parameter mixer expert with 12 billion active parameters.

[00:10:14]Now compared to some of the other models like deepsek3 which was I think 600 parameter models, it's relatively small. However, if you look at some of the real time interaction or omni models in the open source space like Mushi from QAI or even the Quinn Omni, this is a relatively big model and it's kind of funny that uh 276 billion parameter model is a small model according to thinking machine which shows you that the frontier models that we are seeing there are orders of magnitude bigger uh and kind of give a clue because these people have been working at all the Frontier Labs. Okay, I want to show you one more example before we look at some of the benchmarks. So, this is an interesting example where she's going to show her hands and then the model is supposed to count. So, let's have a listen.

[00:11:13]>> Hey, watch my hand and give a running number of my fingers. Give me an update whenever I make a change.

[00:11:20]>> Got it. I'll keep track.

[00:11:23]Five fingers up. Two fingers up. 10 fingers up. Nine fingers up.

[00:11:31]>> Okay. Again, since it's tokenized time, it's able to basically look at this live stream almost real time. Now, these other models like Gemini or even real time GPT, they basically take images at certain interval. So, if you look at something like GPT realtime high, let's have a listen.

[00:11:54]>> All right, I'll keep an eye on your hand and count out loud whenever your fingers change. I can't see your fingers right now, so the visible count is zero. Bring your hand into the frame and I'll call out the number each time it changes. All right. So, it seems like it's completely missed because probably the interval it's using to take sample somehow misses the hands or it's not able to recognize it. Right. Let's have a listen to Gemini as well.

[00:12:22]>> I can see your hand yet. Let me know when you're ready to start.

[00:12:25]>> All right. Same issue. And if you look at the other demos, uh you're going to see similar issues. Okay. So, that brings us to the benchmarks. Now, uh there's only one benchmark that actually measured true interaction. These other benchmarks don't really measure true interactivity between the models. But overall, I think it's a pretty impressive model. But when the model is released, then we're going to see how good this model is in real world scenarios. Here's a plot which measure intelligence versus interaction quality.

[00:12:58]So this TML small scores really high both on the interaction axis as well as on the intelligence axis. And then here's another one which is intelligence versus responsiveness. Again it seems to be pretty high on both axis. Now again these are self-reported results. There are some benchmarks that they have introduced. We will need to wait for external validation. But overall, I'm actually pretty happy with what I'm seeing here because not only they shared a lot more technical details than any frontier lab, but they also openly acknowledge some of the work they have used as a foundation to build on top of.

[00:13:35]I definitely think that it's a very interesting release especially being the first product out of thinking machines and based on the people who are still there I think it's a very strong team and I would be looking out for their releases because I think they are going to play a very important role especially they're not releasing an LLM they are trying to completely change the paradigm of how we interact with these computer systems Anyways, do let me know your thoughts.

[00:14:07]What do you think about this release?

[00:14:08]Have a look at the demos and tell me if you are impressed or not. I hope you found this video useful. Thanks for watching and as always, see you in the next one.

#prompt engineering #Prompt Engineer #LLMs #AI #artificial Intelligence

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30