Instala nuestra extensión para buscar dentro de cualquier video al instante

Detection vs Grounding Explained (Why Free-Form Language Changes Everything)
Añadido: 2026-05-23

670 vistas182:23LearnOpenCVLanzamiento original: 2026-05-20

Object detection identifies fixed classes of objects using bounding boxes and labels (e.g., YOLO, RF-DETR), while visual grounding uses free-form language to locate objects by matching abstract words to physical reality, a concept coined by cognitive scientist Stevan Harnad in 1990 to describe stabilizing language through association with concrete visual elements.

[00:00:00]What is the difference between detection or localization and grounding? Now, when we do object detection, we have a class of objects that we want to detect, and [music] we basically train the model, and we get a bounding box and the class label associated with it. YOLO and other models, they solve this problem, right?

[00:00:21]RF Ditter, YOLO, etc., they solve this problem. So, that is the detection problem. Now, grounding is a cousin of this problem, and in grounding, you are trying to actually [music] do localization, but based on language.

[00:00:37]So, you are, for example, you could say that I want the red car in the crowd of cars, right?

[00:00:46]And that localization problem, that is grounding, because it is free-form text, it is matching language to visual reality, right? So, that task is called, you know, grounding. It comes from [music] Steven Harnad. He was a cognitive scientist, and in the 1990s, he came up with this paper where the word grounding was first used in this context. [music] The idea was that grounding basically means stabilizing something, right? When you ground an electrical circuit, you make it more stable. You won't get shocked and stuff like that. Now, when you ground language with visual reality, then you are associating words to something which are abstract, right?

[00:01:27]Words are cooked-up things to something real, [music] like a car is a real thing. It's a real physical thing. Word is an abstract thing. That word car is an abstract thing. So, that process is [music] called grounding, and this comes from 1990s cognitive science, and later in the 2010s or something, um computer vision researchers also started using this phrase to explain what how do you how do you take language and map it to vision reality, right? Visual task. So, that's where the word grounding comes from. And that's the difference between object detection, which basically detects fixed number of classes in a scene, and visual grounding, which basically is free-form language. From free-form language, you're trying to locate things, but based on free-form language, right? So, that's the difference. All right, thanks.

Videos Relacionados

Inteligencia Artificial

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Inteligencia Artificial

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Inteligencia Artificial

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Inteligencia Artificial

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Inteligencia Artificial

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Inteligencia Artificial

AI Doesn't Create Bias — It Inherits It

UXEvolved

176 views•2026-06-01

Inteligencia Artificial

Distributed Inference Challenges Explained #shorts

alexa_griffith

466 views•2026-05-31

Inteligencia Artificial

[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?

TechBridge-KR

1K views•2026-06-03

Tendencias

Why Batman Lets The Joker Live 🤨

zackdfilms

9222K views•2026-05-30

This spider is a VAMPIRE (Kinda...)

moreparz

2764K views•2026-06-02

Making Ai Choose Where I Eat

Tyrecordslol

3080K views•2026-06-03

They're Complete Trash

penguinz0

558K views•2026-06-04