METR's benchmark, which measures AI capabilities by timing how long human experts take to complete tasks that AI can complete with 50% reliability, has reached its measurement limit at 16 hours with Claude Mythos, demonstrating that even well-designed benchmarks can become unreliable when AI capabilities exceed the measurement instruments' capacity.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Day 2: METR againAdded:
Yesterday, METER ran out of ways to measure the capabilities of their new models.
Hi, I'm Ionut. I'm a freshman at MIT and I work on AI safety. Quick recap. METER benchmarks frontier AI models based on how long it takes human experts to complete tasks that the AI can complete with 50% reliability.
At the beginning of 2020, it was about 4 seconds. At the beginning of this year, it was 14 hours. Back in March, Anthropic gave access to METER to their new unreleased model, Claude My task.
And yesterday, METER released the result, at least 16 hours. But, with a footnote. Above 16 hours, METER doesn't trust its measurement anymore. Why is that? Because they don't have enough tasks that are that long. Think about that again. The benchmark didn't say the model failed. It said, "We can't measure it anymore."
The doubling trend has been steady for 6 years, and officially, measuring instruments are starting to give out before the trend actually does.
And that's what I'll leave you with for day two.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











