ZAYA1-VL-8B is an 8-billion parameter vision-language model that achieves superior performance to larger models by using bidirectional attention for image tokens (allowing all image patches to attend to each other simultaneously, unlike traditional left-to-right processing) and vision-specific lightweight adapter parameters (dedicated parameters for visual processing rather than forcing visual information through language weights), enabling it to outperform models like Molmo, Deep Seek VL2, and Qwen 3 VL while being trained on only 140 billion tokens compared to trillions used by competitors.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
ZAYA1-VL-8B: Efficient Open Visual Intelligence - Run LocallyAdded:
Zeffa is on a roll. After language model, they have just released this vision model which is punching well above its weight, beating model twice its size and does it all on a fraction of the training data everyone else uses.
And this model surely is turning some heads in the open source AI world. Let's install it and test it out. This is Fahad Mirza and I welcome you to the channel. As you can see, we already have covered their language model in great detail and it was pretty impressive. So, I'm quite curious to see what exactly this new model is all about. It's an 8 billion parameter vision language model, meaning it can look at images and answer questions about them, read documents, understand charts and even help with GUI and computer use tasks. What makes it genuinely impressive is the efficiency story if you look at this chart. While most competing model of this size are trained on trillions of vision language tokens or even billion, Zaya 1 VL achieves comparable or better results having seen only around 140 billion tokens. That is a dramatically leaner training run. It is a mixture of expert model, so despite having 8 billion token total parameters, only 700 million are actually active during inference and that was the same story with their language model as we saw.
Now, if you look at this chart which they showed you the average performance, this tells you that it has beaten models like Molmo, Deep Seek VL2 and even Qwen 3 VL at its active parameter count.
Great thing, fully open source Apache 2 license, so you can actually use it uh anywhere commercially. Let's uh right get into the installation and we will talk more around its architecture. I'm going to use this Ubuntu system. My GPU card is Nvidia RTX 6000 with 48 GB of VRAM. If you're looking to rent a GPU on very good price, you can find the link to Mass Compute in video's description with a discount coupon code of 50% for range of GPUs.
Our virtual environment is done. Let's now install all the prerequisites. Make sure that you're installing the transformer from this branch of Zaya 1.
It is going to take a couple of minutes.
Everything is in now installed. Let me now launch my Jupiter notebook.
And now let's download our model.
And the model is being downloaded.
And the model is loaded. Let me quickly show you the VRAM consumption.
It is consuming just touch over 26 gig of VRAM.
Not a bad VRAM consumption for a vision model. I have seen lower two, by the way. But anyway, let's check it out on some of the prompts.
Okay, so first up I'm going to do a hard OCR one. This is an old real newspaper where I'm going to ask the model to just extract headlines from this image. So, it should be extracting this one, two, three, four, and five. Five headlines it should extract. It shouldn't be extracting any title or anything like that. Let's see how this works.
And the model is working through it. Let me scroll down. There you go.
The first one with comma, the second, and third, and fourth. I think it has done pretty well. So, Mr. MP is there.
Amazing stuff. Okay.
Let's do another one.
The next test is about a handwritten letter. So, this is the one.
And it's a very different sort of handwriting from the olden days.
So, not only I'm asking the model to extract the text from this image but also identify the nature of text. Let's see how it goes. Does it just it it just says it's a letter to editor or something else. Should be fun to see.
The model has returned the response but as you can see it didn't really extract the text. All it has done is just explain what exactly is the text and then it is talking about the nature of text as it is informal and critical.
So which it has done pretty well. Also I think it has explained the text pretty well but didn't extract it. So I think I would call it a fail. So I'm just going to change the prompt and see if it follows the instructions.
And I'll just clear it and then write rewrite it rewrite it.
So I have just repeatedly told the model to just extract the text.
Don't do anything else.
And the model has extracted the text and if you check this out and I have read it it has done perfectly well. It didn't even miss a comma here which is quite good.
Maybe these lines are a bit longer but nitpicking is always there. Other than that the model has done really really well. Let's check if it is multilingual or not.
Okay so I'm going to give it an image and I'm asking it to identify the languages being used in the text in that image. This is I made some you know AI generated airport sign and various languages are there.
So let's see if it is able to recognize these languages and then we will check out if this is your language you're also welcome to tell me. English, Japanese, Korean and Russian.
I think it has done well pretty well.
Well done.
In the next one I have expanded this test. So I'm giving it lot of languages as you can see here.
I'll just scroll down.
And if I go back, I'm also also asking it to extract the text and then identify the languages.
So, this is what it has produced. Let me scroll down.
There you go. So, you see that this it says that this image contains a multilingual text with translations of the English proverb, the early bird catches the worm. So, I think it has done well.
And if I check the languages, and please also help me out in the comments, what do you think?
These are all the languages.
Um some of them it has totally missed.
Didn't even, you know, bothered doing it. You know, Southeast Asian languages couldn't do.
Um I'm not sure about various other languages, so let me know in the comments.
Some of them it has done well, some of the well-known ones, but most of them I don't think so it was there. So, multilinguality might not be the forte of this model.
Well, um I do you know, one more last test around invoicing, let's talk about this architecture.
So, the architecture looks quite nifty.
There are two ideas which this um Ziff Davis trying to do differently. The first problem Ziff Davis identified is that standard vision language models process image tokens the same way they process text, left to right causally, as if one pixel patch needs to wait for the previous one before it can be understood. But, images do not work that way. A city skyline does not have a reading order. So, Zaya 1 VL8B uses bidirectional attention for image tokens, meaning every image patch can attend to every other image patch at the same time, the way vision actually works.
The second innovation is vision-specific lower parameters.
Rather than forcing visual information through the exact same weights used for languages, the model activates a dedicated set of lightweight adapter parameters whenever it is processing image tokens.
You can just consider it like having a specialist garden just for the visual work. These two changes together let the model integrate visual and textual understanding far more cleanly than the standard approach, and that is why they are claiming that it performs quite well and we have seen in some of the example that it does. Now, let me take you to the next and final example. So, I'm going to give it an image of a form.
I think I already have it there. This is a form.
And there are various fields in it. So, I'm asking it to extract the fields from this image and put them in a comma-separated list. This is a form.
So, there is a It's uh there are It is comma-separated. And it says type, default text, maximum length, and I think run macro on this is where it missed it because this is the entry here.
And then I also don't see the text format.
It's not perfect, but it is there. So, we are just, you know, stretching the limits here. But anyway, I think it's always good to see a new player in the market. Um Zephyr Zephyr is active again after 1 year, which is good. We already have covered their language model, which was pretty good as you can see in this video just 2 days ago. That's it. Let me know what you think. Please like the video and subscribe and consider becoming a member. Please also follow me on X if you're looking for AI updates.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











