Nanowhale-100m is a compact 110 million parameter language model that successfully implements the DeepSeek-V4 architecture on a single GPU, demonstrating that complex transformer architectures can be miniaturized while maintaining core design principles. The model uses Multi-head Latent Attention (MLA) for query compression, a mixture of experts layer with 4 routed experts plus 1 shared expert, hyper connections with Sinkhorn routing, and an extra multi-token prediction head. Despite its small size, it requires only about 1GB of VRAM and can run on commodity hardware, making it an excellent educational tool for understanding how large language models are constructed.
Deep Dive
Voraussetzung
- Keine Daten verfügbar.
Nächste Schritte
- Keine Daten verfügbar.
Deep Dive
Nanowhale-100m: Fascinating Implemention of DeepSeek-V4 ArchitectureHinzugefügt:
DeepSeek is often called as whale due to their logo plus the way they disrupted AI last year.
That is where someone came up with Nano whale which we are going to cover in this video. Nano whale 100 million is a compact 110 million parameter language model that quietly proves you can run a full DeepSeek V4 architecture on a single GPU card and that is exactly what we are going to do in this video.
This is Fahad Mirza and I welcome you to the channel.
Let me first set the stage by telling you a bit more about this model.
This model is built entirely from scratch with no borrowed weights from DeepSeek. It was first pre-trained for 5,000 steps on 2.6 billion tokens from fine fine web edu.
Then fine-tuned for 3,000 steps on 460,000 chat examples from small talk from hugging face. The result is this chat ready model that fits in a few hundred megabytes, uses the same 129 K token DeepSeek tokenizer as its big sibling and runs comfortably in um any of the commodity hardware which we will also test out shortly. I will be talking more around this architecture but for now, let's get this thing installed.
Uh I believe you can run it on even CPU with good memory but I'm going to use this Ubuntu system with this GPU card Nvidia RTX 6000 with 48 GB of VRAM and this is a URL which I will drop the link in video's description. As I said, you really don't need that much GPU. You can also try it out on your CPU.
I will create a virtual environment but if you're looking for a VM or CPU or GPU to rent for very very good price, you can find the link to Mass Compute in video's description with a discount coupon code of 50% for range of GPUs.
Let me go back to my terminal and let me install all the prerequisites which are torch and transformer and then I'm launching my Jupiter notebook. This is going to take couple of minutes.
While that happens, let's go back to the architecture of this model because I think more than the performance, what exactly this model can do, um it is mainly a research or maybe, you know, some ideas for you to start learning these architecture behind the scene.
So, as I said earlier, the architecture of this model primarily tries to shrink DeepSeek V4 uh and whatever tricks they have used into just eight layers and a 320-dimensional hidden size.
It replaces standard attention with multi-head latent attention or MLA that compresses queries for efficiency, adds a tiny mixture of expert layer with four routed experts, plus one shared expert which is top two routing.
It also swaps regular residual connections for hyper connections that use sinkhorn routing and includes one extra multi-token prediction head to improve training.
Most of its 110 million parameters actually live in the embedding table because of the huge vocabulary leaving the rest of the model surprisingly lightweight.
The real significance is educational which we will talk more about.
Let's quickly grab the model from Hugging Face.
The model is loaded. Let me quickly show you the VRAM consumption.
So, just over 1 gig of VRAM. You can simply run it on your CPU. So, this is my memory at the moment and this is my CPU consumption which you can see is very, very low.
Okay, so that all good. Let's now test it very quickly.
If you look at this code chunk, it's a very familiar standard inference code where we are wrapping our prompt in a chat message format and using the tokenizers chat template to structure it the way the model was trained to expect.
Then we are encoding that into token IDs, moving them to the GPU, and pass them into the model's generate function with some hyperparameters or sampling settings to control creativity.
Finally, we are decoding and decoding only the newly generated tokens back into readable text, stripping any special formatting characters before printing the response. So, let's run this.
And this is the response, which is a total rubbish. Now, the thing is the model's limitation is the actual story here. Um the output, I don't think so it matters because this is a 100 billion parameter model built on the similar architecture as DeepSeek. It's a fascinating experiment.
Um it's like fitting a whale into a thimble, seriously. The architecture is there, the instruction following ability simply isn't at that scale. I think base model is fine. That is why I was saying that uh even in this one, if you go to their hugging face card, I would suggest just picking up this uh I think they had it somewhere. Anyway, I'm not going to check during the video, but um just get their model if it is available, the checkpoint, and then go from there with your own experimentation or supervised fine-tuning.
But this is what the capability looks like at the absolute edge of uh miniaturism of this model if you are looking to just do some exercise in learning the architecture and how exactly these models are built.
So, I'm going to go easy on this model and I will just say once upon a time.
Let's run this.
And let's wait for it to come back.
Even the speed, you can see that it is a bit on the slower side of things.
The whole layers are loaded onto GPU, which is quite a modern one. And there you go.
So, you see that again, it is behaving like a pre-trained model, totally off the tangent, doesn't respond to what I'm asking. But again, as I said, the main story of this model is that how they have miniaturized it. And Nanoveil is actually a Nanoveil. So, don't expect any groundbreaking, earth-shattering performance. But if you're looking to learn the model architecture, I think uh good fascinating experience.
That's it. Let me know what do you think about this. Please follow me on X for any AI updates. And please consider becoming a member if you uh want to help out the channel. Thank you for all the support.
Ähnliche Videos
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Are AI deceiving us? | Roman Yampolsky, Gleb Solomin #AI #science
shortsGlebSolomin
1K views•2026-06-02
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
AI Doesn't Create Bias — It Inherits It
UXEvolved
176 views•2026-06-01
Distributed Inference Challenges Explained #shorts
alexa_griffith
466 views•2026-05-31
[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?
TechBridge-KR
1K views•2026-06-03
Starting & Test Driving JAKE'S Abandoned BUS from Subway Surfers | POV Restarting
RestartGaragePOV
4K views•2026-06-04
Building the Future of Voice-First Sovereign AI: Sarvam & NVIDIA
NVIDIA
3K views•2026-06-01











