NVIDIA's Nemotron 3.5 ASR is a 600 million parameter streaming speech recognition model that supports 40 languages through a cache-aware fast conformer RNN-T architecture with language ID prompt conditioning, enabling real-time punctuated transcriptions with configurable chunk sizes as low as 80 milliseconds while consuming minimal memory (439 MB) and allowing deployment on CPU systems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
NVIDIA Ships Nemotron 3.5 ASR Streaming 0.6b: Run Locally on CPU
Added:I present you NeMo-Tron 3.5 ASR, a model from NVIDIA which you just saw in action. Let me play the original audio and check.
So, this transcription was done instantly by this new model from NVIDIA, which is ASR, a 600 million parameter streaming speech recognition model that handles 40 language locales from a single unified model.
In this video, we are going to show you how to install it, and we are going to test it out. This is Fahad Mirza, and I welcome you to the channel.
Recently, I have also started this free weekly AI newsletter, which you can subscribe to from the home page of the channel fahadmirza.substack.com, or you will also find the link in first pinned comment down below in this particular video.
Coming back to this model, this was just released. It uses a cache-aware fast conformer RNN-T architecture with language ID prompt conditioning to deliver punctuated, capitalized transcriptions in real time with configurable chunk sizes as low as 80 milliseconds.
This is at the moment running on this Ubuntu system. I have one GPU card, NVIDIA RTX A6000, with 48 GB of VRAM. At the moment, you can see that the model is consuming just 439 MB. That's it. So, you can easily run it on your CPU. Let's do few more samples.
Next, I have just uploaded this English file.
Definitely, it was not English. It was Arabic. Um and also, if you are that language speaker, please also let me know what you think. I think it is done wonderfully well in terms of not only speed, but also the quality. I'm auto detecting the language, but you can also specify the language.
Um as per here, as I said, it supports 40 languages from across the world.
Still, I believe the coverage is a bit low, but again, 600 billion parameters.
That's it. Uh let me upload an English one, and then we'll go from there.
So, this is an English audio, and all of it is from the Google's Fleur data set, which is for ASR eval.
This a particular audio is very low volume. So, please, um you know, listen carefully. And we're actually testing this model to figure this out. Let me play this.
>> However, due to the slow communication channels, styles in the West could lag behind by 25 to 30 years.
>> And it has done really good. Um even with low volume, very low pitch, it was able to detect whatever was being said.
Okay, let me upload another language from my local system.
Maybe I'll just go with Uh I'm just going to randomly select and then we will go from there.
So, this is uploading.
Let me transcribe this.
Let me play this.
Yeah, pretty good.
And let's do another.
Please let me know in the comments if you find any mistake.
That's very low.
It was French. I couldn't really It's very very low, but you know, I already have checked it out. Looks perfectly well to me. So, the quality, you know, of this model is quite good.
Let's do it. I believe this one is German, but let's check.
And these are the original natural human voices. I didn't want to do the robotic ones, as you guys have also mentioned few times on the channel. So, this that is why I'm going with this natural voices.
Cool. We will do few more, but for now, let's have a quick look going back to the architecture of this model because I believe that one great thing happening here with this model.
First, have a look at this pipeline.
This is the overall flow. Audio in 40 languages goes into a cache where fast conformer are NNT core, and outcomes properly punctuated text with an optional audio language tag. What makes this clever is the language ID prompt sitting inside the model itself. So, instead of running 40 separate models or a separate language detection component, you pass a single language in at inference time, and one model handles everything. Set it to auto, as we just saw, and the model figures out the language on its own, tagging each utterance in the output.
If you look at this architectural diagram, this is exactly how language conditioning works inside the model.
Audio goes through the fast conformer encoder, producing an acoustic embedding of shape D by T. In parallel, the language ID is encoded as a 128-dimensional one-hot vector and broadcast across every time step, which gives you a K by T tensor. These two are concatenated along the feature axis into a D plus K by T tensor. Then, a linear projection layer squashes it back to D before feeding the RNN-T decoder. So, it is an elegant and computationally cheap way to inject language identity at every single frame rather than just at the start.
And then, usual stuff, some uh you know, word error rate or WER across 15 transcription-ready languages at 320 ms chunk size.
And you can see that the performance, especially for English and German, are in the 8 to 9% range.
And um there are some smaller languages, which is not really good, but most of it, I think, appears okay. The biggest gap for me appears on Ukrainian and Hindi, where auto detect had a few extra percentage points, which suggests that these languages benefit most from explicit language hinting in production. Let's quickly test it out with Hindi.
So, I'm going to first do the Hindi with auto detect. Let's do transcribe.
Let me play this.
Okay. So, I can't really read this. So, if you are a Hindi speaker or reader, please let me know what you think. Let me reload this webpage and then actually select the language ID and run it again.
I'm just going to upload that Hindi file again from my local system as it happens.
Let's select this language ID.
This is the one, I guess. I N is India.
Let me transcribe and it's the same audio file.
What do you think?
Let me know in the comments, please.
Okay, let's do few more languages quickly.
And just to stretch the model, I have selected Urdu.
Um Urdu is not in this and I'm just going to go and do the auto detect.
Anyway, it has already done it.
I'll just say transcribe because Hindi was selected.
So, let's see if it is able to do some job while it doesn't really understand it. I don't think so. The transcription is way off and the script is Hindi.
You see?
It cannot really do any language outside of its training data set. Let's select um you know, a few more languages. I'm just going to randomly select another language.
I will click transcribe and then we will wait and check out.
Yeah, another one is I believe Japanese with a bit of a noise. Let's see how model does it.
Let's do Brazilian Portuguese.
Yeah, not bad.
Let's play another one.
And let's do Russian.
And this is a Vietnamese. Another interesting bit I found about few of the languages that if you do auto detect, it doesn't really work. So, this is this was the case with Vietnamese. Let me play this.
And now I am doing Thai, and you see that it has finished, but it didn't do anything here.
I'm just going to select Thai from here and do it again.
Interestingly enough, for Thai, it's not even doing it in auto detect or even the language ID. I have even tried with accuracy of most, but it didn't work. So, I think Thai language is not really there yet.
But look, other than that, I think given the size of the model, the performance is quite impressive. Let me know your thoughts in the comments and also the feedback about your own language.
Please follow me on X for any AI updates, and if you want to help out the channel, please become a member. Thank you for all the support.
Related Videos
NEW Hermes Mission Control is INSANE!
JulianGoldieSEO
405 views•2026-06-11
Trump News | Trump Shares AI-Generated “Everybody Loves Trump” Video Ahead of Birthday
NDTV
9K views•2026-06-07
Unlocking AI's Dirty Little Secrets: Domain Reduction Explained #shorts
AIExplainedHubX
848 views•2026-06-10
Certified LLM Security Professional (CLLMSP): 100% Free Exam Opportunity
cybersecmaison
107 views•2026-06-08
I Built a 24/7 Finance Analyst With Claude (Full Tutorial)
lukefinance100
302 views•2026-06-11
Apple gives Siri an AI makeover in bid to catch rivals
Reuters
5K views•2026-06-09
Gemma 4 26B A4B QAT vs non-QAT - 16GB Local LLM setup
lukesdevlab
389 views•2026-06-10
The Truth About PewDiePie’s AI (No Fluff)
DeepCantCode
264 views•2026-06-06











