LocateAnything is NVIDIA's 3 billion parameter vision-language model that enables precise visual grounding by locating specific objects in images and videos using natural language descriptions, trained on 12 million images and capable of performing multiple tasks including object detection, OCR, GUI interaction, and coordinate-based pointing, all running locally on consumer hardware with approximately 8-12GB VRAM consumption.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
LocateAnything: NVIDIA’s New AI Sees EVERYTHING: Run LocallyAdded:
Meet locate anything from Nvidia. With a single click, I was able to identify all these sushi's in this single image. I can do the same with videos among various other tasks. In this video, we are going to install this model locally and we are going to test it out. This is Fahad Mirza and I welcome you to the channel.
This new model from Nvidia seems quite impressive. It's a 3 billion vision language model that acts like a superpowered Where's Waldo engine for AI. Instead of just recognizing what's in an image, it precisely pinpoints exactly where objects are.
Whether it's finding a specific car in a traffic jam, as you can see in examples on this image, or locating a submit button on a cluttered website, or reading text from a complex document.
Trained on a massive data set of 12 million images, this one is a generalist expert that handles everything from robotics and autonomous driving to automated data labeling, making it incredibly versatile for developers who need their AI to understand special relationships in the real world. Let's get the installation underway and we will talk more around its architecture and what exactly is this decoding paradigm evolution.
I'm going to use this Ubuntu system. I have one GPU card, Nvidia RTX A6000, with 48 GB of VRAM. If you're looking to rent a GPU on very good price, you can find the link to Mass Compute in video's description with a discount coupon code of 50% for range of GPUs.
I'm going to start by installing all the prerequisites and you can see that I'm already in my conda environment. It is going to take a couple of minutes.
Everything is installed. Now, I'm just going to launch this app.py which primarily is the Gradio interface on top of the code which they have shared in Hugging Face card. I will drop the link to it in videos description. So, let me run this. This is going to download the model first time.
And the demo is now running. Let me open it in the browser.
And it should be running on our local host at port 7860.
There you go. So, it is running at the moment.
Let's um do the detection first. There are various other tasks which we can do which I will go into one by one. Let me upload an image from my local system.
So, let me click here somewhere.
Just a second.
That button wasn't working, so I have just dragged and dropped and now it has it is working. So, first up I'm just going to go with the this image detection type. I will run the inference. The first time when you run this, it takes a bit of a time as it loads the model.
And you can see that the size is just under 5 gig. There are two shards.
And this is a VRAM consumption just over 8 gig, not bad at all.
Let's go here.
And I didn't give it any categorization, so it has just detected everything in the image like all the cars and the bus and everything.
So, for instance, I will just remove uh rest of it. I will just say car here.
Let's now run the inference. Let's wait for it.
And it has detected all the cars in the front, not at the back.
Uh should have done the back ones, too, but anyway. Let's do people. If you can see on the left-hand side, there are people.
Let's see if it is able to detect them.
There you go, it is.
On this side, too, but not these one.
These are not as visible.
Okay, next up, let's do some grounding here. I'm again just going to drag and drop.
So, in this image, and by the way, all of these images are AI generated.
I'm going to select the task of grounding.
So, there are all these tasks. So, detection simply means finds and draws boxes around all instances of specific object.
Grounding means it locates specific objects or regions based on a natural language description. Let's say if I just ask it um the red car. There are two cars here, so let's see if it just grounds this red car.
Let's wait. There you go. So, this is a red car.
Let's see if it is able to do the silver car.
It is, you see? So, this is a beauty of this model.
You can also do OCR.
In OCR, it just identifies and draws boxes around all text visible in the scene, regardless of what the word says.
So, for example, let me just quickly show you the OCR.
So, this is a handwritten text. I'm just going to run the inference on it to see if it is able to do the OCR. I haven't given it any description here.
There you go. So, it has drawn boxes around each and every text.
And you see, not on this line, but only on the text.
Which is pretty good.
It has even labeled it. And there is various other option which it has given, but I'm not going to go into that one.
The next task is GUI. So, I'm just going to select it and run inference. I have already given it the text search. I just want to search this search item in this GUI, and it has selected that. So, you can just uh build a GUI agent on top of it by pinpointing that exact element on the screen, and then you can build your tooling around that.
And similarly, we have this task type of pointing where I have just selected this image, and instead of a box, uh it is going to predict a single precise coordinate XY to pinpoint this location or object in the image. As you can see, I have just pointed it to vegetables.
This also works on video, so I have just uh dragged and dropped one of my own video, AI generated of course, and I am giving it the word line. I just want to detect the line in this video.
Let's run the inference. I'm not sure if my GPU is able to hold both of these, but let's try it out, and we will check out. Let's quickly check the VRAM consumption. So, you see the VRAM consumption is has jumped up. So, it is able to do that.
There you go. So, this is a line.
Very nice. So, let's see if it can do a zebra.
But still, I think it you can easily use it under around 12 gig of VRAM, something like that.
And it doesn't take too long.
You see, it is just consuming uh under 12 gig of VRAM, and these are all the zebras, which it has done fairly well.
So, look, that's it. I think Nvidia has done well with their locate anything in just 3 billion parameter. And as you can see, on my channel, we have been covering these anything models or GUI agents for quite some time. Nvidia has improved it a lot. Let me know your thoughts.
Please follow me on X for any AI updates. And if you want to support the channel, please become a member because membership is what keeps the lights on.
So many people have asked how to become a member.
Very simple.
You just, you know, go to the home page and there should be, you know, this join button. You can click on this to become a member. It is just $4 to $5 a month.
We already have some members which are really, really great and I really appreciate their support. So thank you so much and take care of yourself.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











