Install our extension to search inside any video instantly.

Why benchmark scores lie: eval contamination + Goodhart's law
Added: 2026-04-29

486 views41:00AdamRoslerOriginal Release: 2026-04-27

Public AI benchmarks like MMLU and HumanEval are unreliable because they suffer from two structural problems: data leakage (test questions appear in training data since models train on internet-scraped content) and Goodhart's law (when a metric becomes the goal, labs optimize for it rather than improving true capability). A model can score highly on public benchmarks while performing poorly on equivalent private tests. The solution is to build small, private held-out evaluation sets using real user prompts from your actual workload, which should never be published, as these are the only metrics that truly track the capabilities you care about.

Related Videos

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Artificial Intelligence

AI Doesn't Create Bias — It Inherits It

UXEvolved

176 views•2026-06-01

Artificial Intelligence

Distributed Inference Challenges Explained #shorts

alexa_griffith

466 views•2026-05-31

Artificial Intelligence

[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?

TechBridge-KR

1K views•2026-06-03

Trending

Why Batman Lets The Joker Live 🤨

zackdfilms

9222K views•2026-05-30

Computer Science

Making Ai Choose Where I Eat

Tyrecordslol

3080K views•2026-06-03

They're Complete Trash

penguinz0

558K views•2026-06-04

Artificial Intelligence

Can AI tell what accent I’m using?? #carterpcs #tech #ai #chatgpt

actuallycarterpcs

2732K views•2026-06-01