Object detection identifies fixed classes of objects using bounding boxes and labels (e.g., YOLO, RF-DETR), while visual grounding uses free-form language to locate objects by matching abstract words to physical reality, a concept coined by cognitive scientist Stevan Harnad in 1990 to describe stabilizing language through association with concrete visual elements.
Inmersión profunda
Prerrequisito
- No hay datos disponibles.
Próximos pasos
- No hay datos disponibles.
Inmersión profunda
Detection vs Grounding Explained (Why Free-Form Language Changes Everything)Añadido:
What is the difference between detection or localization and grounding? Now, when we do object detection, we have a class of objects that we want to detect, and [music] we basically train the model, and we get a bounding box and the class label associated with it. YOLO and other models, they solve this problem, right?
RF Ditter, YOLO, etc., they solve this problem. So, that is the detection problem. Now, grounding is a cousin of this problem, and in grounding, you are trying to actually [music] do localization, but based on language.
So, you are, for example, you could say that I want the red car in the crowd of cars, right?
And that localization problem, that is grounding, because it is free-form text, it is matching language to visual reality, right? So, that task is called, you know, grounding. It comes from [music] Steven Harnad. He was a cognitive scientist, and in the 1990s, he came up with this paper where the word grounding was first used in this context. [music] The idea was that grounding basically means stabilizing something, right? When you ground an electrical circuit, you make it more stable. You won't get shocked and stuff like that. Now, when you ground language with visual reality, then you are associating words to something which are abstract, right?
Words are cooked-up things to something real, [music] like a car is a real thing. It's a real physical thing. Word is an abstract thing. That word car is an abstract thing. So, that process is [music] called grounding, and this comes from 1990s cognitive science, and later in the 2010s or something, um computer vision researchers also started using this phrase to explain what how do you how do you take language and map it to vision reality, right? Visual task. So, that's where the word grounding comes from. And that's the difference between object detection, which basically detects fixed number of classes in a scene, and visual grounding, which basically is free-form language. From free-form language, you're trying to locate things, but based on free-form language, right? So, that's the difference. All right, thanks.
Videos Relacionados
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30
AI Doesn't Create Bias — It Inherits It
UXEvolved
176 views•2026-06-01
Distributed Inference Challenges Explained #shorts
alexa_griffith
466 views•2026-05-31
[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?
TechBridge-KR
1K views•2026-06-03











