Video-language models suffer from 'time blindness' because they process videos by sampling individual frames and extracting spatial features, rather than understanding continuous temporal patterns; this architectural limitation causes them to fail completely (0% accuracy) on tasks requiring purely temporal reasoning, even when humans can achieve 98% accuracy on the same tasks.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
CVPR26: Time Blindness: Why Video-Language Models Can't See What Humans Can?Added:
Welcome everyone. I am Ujwell and today I'm going to share our finding about current state of the uh video language models and present an overall benchmark that we call time blindness. So we evaluated top tier video language models including closed source models like GPD4 and Gemini and a lot of open source models and we discovered that they are completely blind to purely temporal patterns. While human evaluators can easily recognize these patterns with 98% accuracy, every single state-of-the-art model that we tested score exactly 0%.
So let's explore why this massive gap gap exists.
So to understand this failure, we have to look at how today's video language models are built. So currently VLMs do not truly watch a video as a continuous stream. Instead they sample individual frames, extract the spatial features from those static images and then rely on uh uh sequential language models to basically guess what happened in the time between them. They are attempting to understand the time purely through spatial lens relying on obvious static visual cues rather than genuine temporal reasoning.
So to prove this we created a benchmark of a noble data set called spooky bench.
The premise is simple. We we built a benchmark consisting entirely of videos made of structured and dynamic noise.
Hidden within these noise are words, shapes, objects and moving videos. And the crucial catch is that there is absolutely no special information available in any single frame. And the content reveals itself only through temporal movement of noise. So if we uh observe any single frame in isolation, we we can only see noise. So as uh I will show you an example.
So this is an image of a of a deer which uh you can see on the which which you can see in the video and then when I pause it it completely disappears. Same is the case with this uh word video where you can see the word noise at the center but when we just pause it everything disappears.
So uh there are no edges, no colors, no special clue special clues for traditional vision transformers to latch on to. So how does the hidden content emerge? So we used a principle of opposing noise movement. We apply a content mask like shape of an object or a word and force the foreground noise uh to move in a in one direction while the background noise moves in other other direction. So human brain naturally utilizes the just our principle of common fate grouping the fix pixels that move together which makes the invisible object suddenly pop out. So to human eyes as long as the video video is playing uh the content is visible because of uh opposing movements but to language to video language models that's that's not the case. So we tested human and machines on this data and we showed uh humans these videos they they easily identified the hidden content such as CAD word or a video as well with 98% accuracy. So they maintained excellent accuracy even at higher frame rate uh like uh 60 fps or 30 fps and even at lower frame rate till about 10 fps the performance was uh acceptable at least for words they are able to get it easily but as we drop drop FPS even lower to let's say 1 fps the performance uh was uh 0% for humans as well so uh we tested uh 17 state-of-the-art VLMs and even providing some chain of the thought prompts like uh exactly how the motion is going to appear in the video then also they fails miserably. So they usually respond with hallucinations resulting in like flat 0% accuracy and they give out random words like there is a cat in the video or even if we have some word moving in like let's say noise it still says something some something random and most of the time that word remains the same across the videos.
So you might ask uh is this just a auto distribution problem like the models are not trained on this videos. So maybe that's the reason. So uh we the answer is no. We extensively fine-tuned uh two two different state-of-the-art models like intern 2.58 billion and co 2v and coin 2.5 VL for up to 30 epochs on the spooky bench data set. And we tested them with the with 60 frames per second video without dropping any single without dropping any single frame. And even after training directly on this task, the accuracy was 0%. So this proves that we are dealing with a fundamental architectural inability and not a lack not lack of data or auto distribution problem.
So returning to VLM architecture, we are pointing the pinpointing the bottleneck during the frame sampling. There is a massive loss of temporal information. So that's one one of the reason that there can be a failure in uh purely temporal uh data sets. So the the visual encoder then applies a strong special spatial bias heavily focusing on object scene layouts while almost completely ignoring motion patterns and event commonality.
So this creates a severe coherence gap where uh temporal features are simply lost before they are before they ever reach the language models.
the so the failure goes even deeper than identification. So we trained self-s supervised models like VJ part 2 or Dino V3 on simple binary classification task.
So the task was just just to tell if there is any uh foreground noise in the in in the video. So the model guesses at random and as the graph shows their validation accuracy hovers around 50% and the cross entropy loss flatlined. So that's just random uh guesses of uh whe the loss is 7. So the fact that they don't even overfeit to training data definitely proves that purely temporal signals are structurally inaccessible to architectures that process frames individually.
So however we found a way to unlock their performance. So we we premp computed the motion boundaries using classical optical flow and visually overlaid those boundaries in red and green and uh directly onto the noisy frame. Suddenly the models like uh quentovl and other models like we saw the performance jump like to 51% accuracy. So, so this confirms our hypothesis that model can understand the content perfectly well but only if we manually translate the temporal motion information into explicit spatial features for them.
So why this why is time blindness a critical vulnerability.
So in in real world many vital signals exist purely over time. So in medical diagnostic like pretile EEG's uh pattern for predicting seizures or micro calcifications in imaging might only be detectable through temporal sequence like taking multiple scans over time and uh that can that can cause a issue with current state of BLMs. Then security systems could could mis covert uh behaviors if they are subtle temporal if they are observed in subtle temporal dynamics. And then ultimately spooky events proves that modern modern AI is high highly vulnerable to temporal adversary attacks and cannot truly uh perceive motion based meaning. So we are going to present our results at CBPR in Denver on 7th of June. So give us a hi if you are there if you are want to talk about this more. Thanks.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











