Install our extension to search inside any video instantly.

LocateAnything: NVIDIA’s New AI Sees EVERYTHING: Run Locally
Added: 2026-06-02

1,526 views778:11fahdmirzaOriginal Release: 2026-06-01

LocateAnything is NVIDIA's 3 billion parameter vision-language model that enables precise visual grounding by locating specific objects in images and videos using natural language descriptions, trained on 12 million images and capable of performing multiple tasks including object detection, OCR, GUI interaction, and coordinate-based pointing, all running locally on consumer hardware with approximately 8-12GB VRAM consumption.

[00:00:01]Meet locate anything from Nvidia. With a single click, I was able to identify all these sushi's in this single image. I can do the same with videos among various other tasks. In this video, we are going to install this model locally and we are going to test it out. This is Fahad Mirza and I welcome you to the channel.

[00:00:24]This new model from Nvidia seems quite impressive. It's a 3 billion vision language model that acts like a superpowered Where's Waldo engine for AI. Instead of just recognizing what's in an image, it precisely pinpoints exactly where objects are.

[00:00:43]Whether it's finding a specific car in a traffic jam, as you can see in examples on this image, or locating a submit button on a cluttered website, or reading text from a complex document.

[00:00:57]Trained on a massive data set of 12 million images, this one is a generalist expert that handles everything from robotics and autonomous driving to automated data labeling, making it incredibly versatile for developers who need their AI to understand special relationships in the real world. Let's get the installation underway and we will talk more around its architecture and what exactly is this decoding paradigm evolution.

[00:01:27]I'm going to use this Ubuntu system. I have one GPU card, Nvidia RTX A6000, with 48 GB of VRAM. If you're looking to rent a GPU on very good price, you can find the link to Mass Compute in video's description with a discount coupon code of 50% for range of GPUs.

[00:01:46]I'm going to start by installing all the prerequisites and you can see that I'm already in my conda environment. It is going to take a couple of minutes.

[00:01:55]Everything is installed. Now, I'm just going to launch this app.py which primarily is the Gradio interface on top of the code which they have shared in Hugging Face card. I will drop the link to it in videos description. So, let me run this. This is going to download the model first time.

[00:02:15]And the demo is now running. Let me open it in the browser.

[00:02:20]And it should be running on our local host at port 7860.

[00:02:24]There you go. So, it is running at the moment.

[00:02:27]Let's um do the detection first. There are various other tasks which we can do which I will go into one by one. Let me upload an image from my local system.

[00:02:37]So, let me click here somewhere.

[00:02:41]Just a second.

[00:02:45]That button wasn't working, so I have just dragged and dropped and now it has it is working. So, first up I'm just going to go with the this image detection type. I will run the inference. The first time when you run this, it takes a bit of a time as it loads the model.

[00:03:03]And you can see that the size is just under 5 gig. There are two shards.

[00:03:10]And this is a VRAM consumption just over 8 gig, not bad at all.

[00:03:15]Let's go here.

[00:03:16]And I didn't give it any categorization, so it has just detected everything in the image like all the cars and the bus and everything.

[00:03:27]So, for instance, I will just remove uh rest of it. I will just say car here.

[00:03:33]Let's now run the inference. Let's wait for it.

[00:03:38]And it has detected all the cars in the front, not at the back.

[00:03:43]Uh should have done the back ones, too, but anyway. Let's do people. If you can see on the left-hand side, there are people.

[00:03:54]Let's see if it is able to detect them.

[00:03:56]There you go, it is.

[00:03:58]On this side, too, but not these one.

[00:04:00]These are not as visible.

[00:04:04]Okay, next up, let's do some grounding here. I'm again just going to drag and drop.

[00:04:10]So, in this image, and by the way, all of these images are AI generated.

[00:04:15]I'm going to select the task of grounding.

[00:04:18]So, there are all these tasks. So, detection simply means finds and draws boxes around all instances of specific object.

[00:04:26]Grounding means it locates specific objects or regions based on a natural language description. Let's say if I just ask it um the red car. There are two cars here, so let's see if it just grounds this red car.

[00:04:45]Let's wait. There you go. So, this is a red car.

[00:04:48]Let's see if it is able to do the silver car.

[00:04:54]It is, you see? So, this is a beauty of this model.

[00:04:58]You can also do OCR.

[00:05:00]In OCR, it just identifies and draws boxes around all text visible in the scene, regardless of what the word says.

[00:05:08]So, for example, let me just quickly show you the OCR.

[00:05:13]So, this is a handwritten text. I'm just going to run the inference on it to see if it is able to do the OCR. I haven't given it any description here.

[00:05:22]There you go. So, it has drawn boxes around each and every text.

[00:05:27]And you see, not on this line, but only on the text.

[00:05:31]Which is pretty good.

[00:05:35]It has even labeled it. And there is various other option which it has given, but I'm not going to go into that one.

[00:05:44]The next task is GUI. So, I'm just going to select it and run inference. I have already given it the text search. I just want to search this search item in this GUI, and it has selected that. So, you can just uh build a GUI agent on top of it by pinpointing that exact element on the screen, and then you can build your tooling around that.

[00:06:09]And similarly, we have this task type of pointing where I have just selected this image, and instead of a box, uh it is going to predict a single precise coordinate XY to pinpoint this location or object in the image. As you can see, I have just pointed it to vegetables.

[00:06:28]This also works on video, so I have just uh dragged and dropped one of my own video, AI generated of course, and I am giving it the word line. I just want to detect the line in this video.

[00:06:41]Let's run the inference. I'm not sure if my GPU is able to hold both of these, but let's try it out, and we will check out. Let's quickly check the VRAM consumption. So, you see the VRAM consumption is has jumped up. So, it is able to do that.

[00:06:55]There you go. So, this is a line.

[00:06:57]Very nice. So, let's see if it can do a zebra.

[00:07:02]But still, I think it you can easily use it under around 12 gig of VRAM, something like that.

[00:07:08]And it doesn't take too long.

[00:07:10]You see, it is just consuming uh under 12 gig of VRAM, and these are all the zebras, which it has done fairly well.

[00:07:19]So, look, that's it. I think Nvidia has done well with their locate anything in just 3 billion parameter. And as you can see, on my channel, we have been covering these anything models or GUI agents for quite some time. Nvidia has improved it a lot. Let me know your thoughts.

[00:07:37]Please follow me on X for any AI updates. And if you want to support the channel, please become a member because membership is what keeps the lights on.

[00:07:47]So many people have asked how to become a member.

[00:07:50]Very simple.

[00:07:51]You just, you know, go to the home page and there should be, you know, this join button. You can click on this to become a member. It is just $4 to $5 a month.

[00:08:00]We already have some members which are really, really great and I really appreciate their support. So thank you so much and take care of yourself.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30