The model’s unified architecture elegantly balances computational efficiency with structural accuracy, making it a highly practical tool for local multilingual document processing. However, its reliance on synthetic training data leaves a clear performance gap when faced with the unpredictable nuances of human handwriting.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
Nemotron OCR v2: Fast Multilingual OCR Model: Run Locally on CPUHinzugefügt:
While Gwen and Gemma were ruling the AI scene in the last couple of weeks, Nvidia released this version two of their NeMo Trove OCR model, and that is what we are going to install in this video and test it out. It is their latest multilingual OCR model that reads text from images across five languages: English, Chinese, Japanese, Korean, and Russian, all in a single unified model.
So, let's get right into it.
I'm going to use this Ubuntu system. I have this GPU card, Nvidia RTX A6000 with 48 GB of VRAM. If you're looking to rent a GPU on very good price, you can find the link to Mass Compute with a discount coupon code of 50% for a range of GPUs.
So, our virtual environment is done.
Next step is to get clone the repo of this model, and I will drop the link to it in videos' description.
Let me get clone it. That is also done.
Let's install all the prerequisites with the build command from the root of the repo, and that is going to take few minutes.
While that happens, let's talk more about the architecture of this model. So, look, under the hood, it's very simple.
It runs a three-component architecture.
First one is a RegNetX convolutional backbone that primarily, in simple words, detects text regions.
Then we have something called as transformer-based recognizer that transcribes those regions, and then a relational model that figures out the reading order and document structure like columns and tables, and we are going to test all of these shortly.
The backbone's feature maps are shared across all three components, by the way.
So, the expensive image processing happens only once. That's exactly where the speed comes from, which they have shared in some of their benchmarks.
Finally, this multilingual variant has 84 million parameters and was trained on around 12 million synthetic images, which gives this pixel-perfect labels at word, line, and paragraph level without any manual annotation.
And meanwhile, everything is installed.
Now, one prerequisite which you should have already on your system is Docker. I have this version of Docker installed.
You can also run it with trust forward, but this one is fairly easy. If you don't know how to install Docker, just search my channel, and then launch this Gradio demo from the root of the repo with the help of Docker.
And first time when you run this, it is going to download all the layers, so which is going to take a bit of a time.
And then it is going to launch our Gradio demo. Meanwhile, please feel free to hype the video, like it, subscribe to the channel, and consider becoming a member as that helps a lot. Please also follow me on X if you're looking for AI updates, and you can see that over the course of last few years, we have covered hundreds and hundreds of OCR model models from each and every provider, open-sourced, closed-sourced, but mainly open-sourced. And not only that, we have covered lot of tooling around OCR. So, the you know, channel is filled with OCR stuff. That is what has brought you here. Uh there's plenty of choice for every use case.
Check it out, and uh let us know how you go.
And the model is loaded. It's quite a small model, and now our demo is running on our localhost at port 7860. Let's quickly check out the VRAM consumption as the model is now loaded.
It's not consuming anything. It is just very um very lightweight model, just consuming um quite a decent amount of CPU, but nothing on GPU.
So, the NeMo Trove OCR is running. Let me upload an image from my local system.
I'm just going to go with maybe an OCR of multilinguality, and for that, let me upload an image here.
So, I'm just going to go with this welcome sign uh from maybe an airport, which depicts this welcome in all those five languages which we mentioned earlier.
And then I'm just going to select this multilingual model, and there are few output modes here.
So, if you select this layout mode, it returns full document structure with bounding boxes, reading order, paragraph, line grouping.
And the word is just a bounding box plus text per per individual word.
So, if you want some good um you know, dense text accuracy, maybe this is good.
Sentences for sentences, which is quite clean readable output.
And paragraph groups everything into paragraph. If you just want to show a clean structured extraction, I'm just going to go with the layout, and the multilingual model is selected.
It is generating it. Now, let's check the VRAM consumption.
Still, you can see that it is all running on CPU at the moment, and now it has gone to GPU, but still just around 820 or something, you can easily run it on CPU. So, these are the detected regions, and I think it has detected everything. It is fairly easy in terms of detection. Everything is so well uh structured, and this is the extracted text.
Now, visually checking, I think it has done well, but if you are that language speaker, then please also let me know, especially this for Chinese, it's and also for this one, Korean and others, Japanese.
Uh let me know what you think. The Russian looks quite good to me.
Anyway, so this was fairly easy task in terms of multilinguality. Let's try out few more.
The next one is bit hard. I'm going to test it on a 100-year-old Russian advertisement with pre-revolutionary spelling that even modern OCR models choke on. Um and if you're a Russian native language speaker, please let me know the results. I'm just going to go with the layout. There you go. How quick that one is.
Now, if I just visually test it and check it, it seems that it's a mixed bag here.
Some of it it has correctly read, including, let's say, you know, the pre-reform spelling, some of these.
And number 10 is correct, as you can see. But then, rest of it, you know, I think some of it it has partially got it right. Some of it has wrong. I think what is very obvious here is that it handles clean printed Cyrillic, which is um you know, I think um you know, well-written is quite good.
And I mean the Cyrillic language, but it struggles with decorative stylized fonts where Cyrillic letters visually resemble Latin ones. The large artistic headers are the failure points here, not the small newspaper body text.
So, but I think it is quite good. Just imagine such a small model, and it has extracted some of it, you know, quite correctly. It was a hard one.
Next up, I'm going to test out this invoice because it has structured blocks, header, line items, total, footer.
And I think let's go with the paragraph mode here because uh hopefully it is going to cleanly show, you know, show the model uh that it if it can understand the structure or not. There you go. So, very, very fast, very quick on GPU.
Now, you see it has identified all the sections pretty well.
Hasn't missed a beat there.
And if I check out the extracted text, I uh I think it has done pretty good.
It didn't miss anything, even some of the exclamation marks. Pretty good result for an invoice, I would say.
Okay, next one is this handwritten one, and you can read through it. It's quite funny um letter to the editor.
So, I'm just going to go with the layout. I just want everything here.
And by the way, these images which I'm getting are from Wikimedia.
So, let's run this. It has done pretty well in terms of the sections, and let me quickly now read through the extracted text.
Now, if you read through it, I think it has done pretty good for a handwritten cursive text, but um there are a few mistakes here. For example, I'll just go up.
You know, this word "elbow", it has completely missed it or ignored it, maybe because it's quite a safe model or moral or ethical model, so maybe it just didn't want to say it. Also, it has, you know, um missed some of the other cursive ones.
So, word level recognition is solid.
Reading order is, I think it's a failure.
Cursive with irregular lines spacing throws off the model's line grouping too here.
So, but I think individual words, it has read fine, but let me know your thoughts, too.
I'm also going to test it out with the word. So, let's do the word.
And you see it's the same result here.
Similar sort of thing.
I think they also have this, but uh English only. Let's check it out.
It is taking a bit of a time. Let's check the VRAM consumption.
Same VRAM consumption here.
And now now it has uh English only is better than the multilingual just for the English task as you can see here.
Pretty interesting.
Okay.
I think now by now you know the limitations and strengths of this model and you can imagine um if you can use it for your own use case or not. Let me know in the comments.
That's it. Please like the video and subscribe and consider becoming a member. Please also follow me on X and thank you for all the support.
Ähnliche Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











