METR's benchmark, which measures AI capabilities by timing how long human experts take to complete tasks that AI can complete with 50% reliability, has reached its measurement limit at 16 hours with Claude Mythos, demonstrating that even well-designed benchmarks can become unreliable when AI capabilities exceed the measurement instruments' capacity.
Inmersión profunda
Prerrequisito
- No hay datos disponibles.
Próximos pasos
- No hay datos disponibles.
Inmersión profunda
Day 2: METR againAñadido:
Yesterday, METER ran out of ways to measure the capabilities of their new models.
Hi, I'm Ionut. I'm a freshman at MIT and I work on AI safety. Quick recap. METER benchmarks frontier AI models based on how long it takes human experts to complete tasks that the AI can complete with 50% reliability.
At the beginning of 2020, it was about 4 seconds. At the beginning of this year, it was 14 hours. Back in March, Anthropic gave access to METER to their new unreleased model, Claude My task.
And yesterday, METER released the result, at least 16 hours. But, with a footnote. Above 16 hours, METER doesn't trust its measurement anymore. Why is that? Because they don't have enough tasks that are that long. Think about that again. The benchmark didn't say the model failed. It said, "We can't measure it anymore."
The doubling trend has been steady for 6 years, and officially, measuring instruments are starting to give out before the trend actually does.
And that's what I'll leave you with for day two.
Videos Relacionados
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30
AI Doesn't Create Bias — It Inherits It
UXEvolved
176 views•2026-06-01
Distributed Inference Challenges Explained #shorts
alexa_griffith
466 views•2026-05-31
[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?
TechBridge-KR
1K views•2026-06-03











