This video brilliantly strips away the AI hype to reveal that modern intelligence is essentially just high-dimensional geometry. It correctly identifies that the battle for better AI is won or lost at the representation layer before a single calculation even begins.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Vectors all the way down: AI Mental Model
Added:Hello everyone, welcome to the third lecture of this series which is the great mental models of artificial intelligence.
Today we are going to uh cover the third mental model which is vectors all the way down.
I'm very very uh excited about this lecture because the more I have been looking at generative artificial intelligence and the fact that we are now so good with text, images, audio and video. If you look beneath all of this, there is only one trick and that is turn whichever media you have or whichever media format you have, you turn that into vectors.
Whether it's text, whether it's image, whether it's audio, video, you first turn it into vectors and then all the modeling really begins. So this video is also my tribute to vectors. They may seem simple vectors and matrices. But they are at the heart of all artificial intelligence revolutions which we are seeing around us right now. Whether you interact with Siri, Alexa, claude code, whether you're using Google maps, um any type of artificial intelligence system which you are using has vectors as their building blocks.
So today's mental model is a tribute to that and if you keep this in mind um things or innovations won't seem as complex to you. You will say that okay it's again vectors all the way down.
So in school you met vectors and matrices right you learned you looked at them as little numbers in mat in matrix format or a vector format and at that time it seemed that okay I'm just learning it in school it's so boring I just have to answer questions in an exam and let me move to the next grade nobody mentioned to you at that time that you are looking at the foundation of the entire modern artificial intelligence era every large language model every image generator, every voice that talks back to you, what is left all the way down is just vectors and matrices.
So I want you to think back to your school days or your college days and think about what you thought about vectors and matrices. Was it just a boring subject you learned and you would not you would think that why am I being taught this like so many other things in school? Why is my mathematics professor teaching me about vectors and matrices?
What's the use of this?
Well, the use is that it's literally the concrete like it's the building block of everything in artificial intelligence right now. Um, so let's get started. First, people say that computers don't understand words only numbers. And that's true. A machine only actually speaks numbers.
This is the primary reason why vectors are so important in artificial intelligence.
But no one really explains what does this mean. What does it mean? Computer doesn't understand words only numbers.
What it means is that computers usually operate on these processing chips, right? Which you can think of the engine of a computer where all the calculations happen.
A processor is built out of transistors and at the transistor at its very core is just on or off. It's zero or one.
When you wire these transistors together, you can actually build logic.
You can build an arithmetic unit and you can get a machine which can do exactly two things to numbers. You can add and you can multiply.
Everything else subtraction division the entire tower is built out of just two operations.
That is all a machine can do. It has a bunch of transistors which are inherently switches on or off zero or one. When you wire multiple switches together, you can build logic out of it.
So a machine deep down is a processor that does arithmetic just on numbers.
That is the complete vocabulary of the machine right numbers in and then you add and multiply and numbers out.
So it's a transistor which is a logic gate and it's called an arithmetic logical unit etc. But it's numbers in and numbers out.
So if you want a computer or a machine to even do anything with poems, a face, a song or a film, you have no choice in the matter, you first must turn that media into numbers.
Why do you have to turn it into numbers?
Well, to operate at the transistor level for the computations to actually proceed, you have this arithmetic logical units or arithmetic logic unit which can only deal with numbers.
So it makes sense that any form of media, let's say you have a song which you want an AI to look at the lyrics and turn it into another melody.
First the AI or the machine has to understand what that song means. Right?
For that it has to convert it into numbers because it can only process numbers and nothing else.
Similarly for videos, if you want AI to create new videos or to analyze a video or an image, that video or an image must first be converted into numbers. So everything must become numbers. Whether it's a page of text, whether it's a framed photo, an audio waveform or a film strip, all of that needs to be converted into numbers.
So we have text, we have image, we have audio and we have video.
These are four great rivers of human media.
Language models, write, diffusion models, paint, new model, generate whole videos, and even voices. This feels like four separate things, right? But let's actually look at yeah, this this this one image sums it all. We now look we now live in an age of generative AI where it's actually even difficult to know what is real and what is AI generated. You can look at there is actually a website like this person this person does not exist and all of the images which you are seeing right now every time you refresh this website you'll get a new image none of these images are real all of them are actually built by artificial intelligence anyways we now live in the age of generative AI where AI can create text audio video and it can even draw images so AI creates all of this but For all of this there are models right the we have a model for language we have a model for images we have a model for audio and video but the starting point of all of these models is that every single thing which goes to a model becomes a vector before any clever model actually gets to do a single thing every piece of media is first turned into a list of numbers only then we can do whatever we want with it that's why I say vectors all the way down let's see how it happens in different media forms around us. If you look at language, right, it's text. Of course, computers cannot understand text. When we say cat sat on the mat cannot understand this because it needs to be converted into numbers. Then there is a whole field of how do you take text and you can convert it into numbers. The first thing you do is you tokenize which means you you divide your text into smaller units. So the cat sat on the mat. It can either be word based tokenization where every word is a token like the the is one token, cat is one token, sat is one token, on is one token. You can do it like this or you can even do a character level tokenization. So if you look at unhappiness, this whole thing can be one token. In character level tokenization, every character is one token. And you can even do subvert tokenization where parts of this are actually one token.
How do you find out where to cut and what level of atomicity one token is?
That's a huge area of active research and there are algorithms like the bite pair encoding technique, the sentence piece tokenizer.
But remember that all people find modeling very cool, right? But a lot of unexplored research actually lies before the modeling stage which is the tokenization. There is so much scope for brilliant research over over here that everyone just looks at modeling but I keep telling everyone enrolled in my classes that okay just look at before modeling right what's happening before modeling that's a completely unexplored field and area and maybe you can be the first one to invent a new tokenization algorithm anyways the first step is the tokenization and let's say you have tokenized word based like the cat sat on the mat Tokenization still does not convert the text into numbers. It just a way of splitting the text into atoms. Right? At these are called atoms of language.
Every token can be looked at as an atom of language. And then you need to convert the tokens into vectors.
So what is usually done is that every token is given a token ID and that token ID is converted into a vector in a higher dimensional space.
The way it's usually done is that you keep a token embedding matrix and it's a lookup table. Essentially, you look up the token ID and out comes the vector out of it. Before anything happens with the language model, before you train a language model to understand anything about language, the first thing is uh now go to chat GPT and try to understand this. When I say something like tell me or explain today's news simply or summarize key national news clearly and neutally when I say something like this to a language model the first thing which is happening is tokenization this is getting converted into individual tokens and the second thing which is happening is that we are looking at those tokens in a lookup table and all of that is being converted into a vector. That's what's happening underneath before even anything goes into the model and you get this brilliant output.
Just open the black box. And I want to say that before you get any answer on uh any language platform like let's say if I go to Gemini, right?
Let's say I go to Gemini and if I ask any question here that travel trip ideas and when I enter you are going to see an output right now but when I press enter what is happening behind the scenes is that these to these this sentence or this prompt is converted into token ids or tokens those tokens are converted into vectors and those vectors and are then processed by the language model. So what I mean to say is that modeling is one thing but tokenization and vectorization which comes before the modeling is equally important but it's underappreciated.
I hope now all of you when next time you interact with Chad GPT or CL claude or Gemini uh you will realize that what you learned in school or college that's what's actually happening underneath the hood. The text is tokenized and it's converted into a vector. I find this to be brilliant because it's at the heart of all generative AI, not just language models.
What I mentioned above was for language models. You can uh apply the same logic to different types of media as well. Let's take a look at images right now. Okay.
Now, uh again a computer really does not understand uh images, right?
it cannot understand images. So how do you convert images into numbers? Well, an image is already almost numbers. It's a grid of pixels and each pixel is just few values. So if you take a look at yeah, if you take a look at any image and if you zoom in further into that image, you'll see that it just a bunch of pixels and each of these pixels has some value.
So those are numbers. That's how computers actually go ahead and process images in the first place. Every image is actually a grid of pixels. And every pixel has a value based on its intensity.
If it's an RGB, then it's colors. But let's say uh if it's a black and white, yeah, this is a black and white pixelized image. This this looks like an image, right? But the more you zoom into it, you see that these are pixels and the value of the pixel corresponds to the intensity. That's how you get an image. That's how AI actually processes images.
So modern vision models take one more step. They cut the image into small square patches. They flatten each patch into a vector and treat those patch vectors as tokens exactly like words.
Now this is incredible because when I explain tokenization to people, it's very easy to explain tokenization for text. You take your text, you convert it into individual tokens and you convert those tokens into vectors. For images, it's a bit difficult, but it's amazing, right? You take an input image, you break it into patches.
Let's say you break it into four patches. You could you you have the pixel values of each of these patches.
you flatten them out into a vector and you lay down those flattened vectors side by side.
That's your model input to image models right now.
Um so if you take modern vision models, if you take any modern vision model actually vision language model, let's say if you take Gemini, GPT, all of them are vision language models. When they process image, this is what they are actually doing. When they convert image into patches, the patches are pixels essentially. Those pixel patches are essentially flattened out into a vector.
So you cut the image into small square patches. You flatten each patch into a vector and you treat those patch vectors as tokens exactly like words. So an image becomes a sentence of patches right now. So the computer or the AI or the machine does not see the image but it just sees the sequence of tokens or the model input.
This is for images. For audio it's different. Sound is a wave which is an air pressure changing with time. And the first thing we do is to just convert that wiggle or that wave into numbers.
But a modern trick is amazing. It's much more beautiful than that. A neural codec such as sound stream encod etc. runs the waveform through a convolutional encoder down to a slow stream of latent vectors.
So this is all sounding very complex right now but let's start very simply.
What essentially is audio is that it starts as a wave and then you can sample the amplitude. It's a wiggle in time.
You can sample it into numbers. That's the simple trick. Modern processors such as neural audio codec has something called residual vector quantization. No need to get worried about it. But essentially what happens here is that uh you pass through this complex architecture and out comes discrete audio tokens. These are actually vectors. So the audio is also converted into vectors. Remember that when you give an audio sample to any model to process what's happening first is that it passes through this big pipeline. And I'm not going through this pipeline right now. But essentially what comes out of these pipelines are vectors similar to the vectors which came for the images.
So even an audio is converted into a vectorzed format. So you can check sound stream. Now uh yeah sound stream and end to end neural audio codec.
These are essentially modern encoders which take any sound wave and convert them into vectors. Anyway, what I'm trying to say is that before you have an audio model which looks at an audio, it's first converted into it's first converted into vectors. Similarly, for video, if you look at the video, it looks like the most complex medium, right? Because so many things are happening at one time. There are multiple images which are stacked together. So, what is a video? A video is just a stack of frames over time.
So if you look at systems like Sora, OpenAI, it's a video generator and similarly there are many video generators like Sora which create video purely from text.
Now the modern tokenizers, the kind behind systems like Sora, first compress the video into a smaller latent volume in space and time with a 3D encoder, then slice that volume into space-time patches, little cubes each spanning a patch of the picture and a few frames at once.
This is a space-time patch which fra which spans a few images over time and then each cube is flattened into a vector.
So what is done is that you take a video you take some images out of it. So you take little cubes from the image and then just like we created patches in images right here also we create patches in images but now we stack multiple patches together in time and then a cube you get a cube like this which is then converted into a vector.
So here in images what is done is that the image is converted into patches and each patch is converted into a vector right what happens in videos is that video we take videos are stack of images right so we take an image we get its patch but we aggregate these patches over time so this cube essentially is then converted into a vector I don't want to go into too many details of the audio tokenization video tokenization etc because I don't want you to be confused or intimidated by is what I want to mention is that no matter what form of media you take text, audio, image or video. So let's look at all four right we looked at text we looked at images we looked at audio and we looked at video all all four media formats have the first step of taking that media and convert it converting it into a vector only then the modeling happens later and many actually people neglect this first step of vectorization because all the cool stuff is in the modeling right but if you pay equal or close enough attention to vectorization you'll see that it's really worthwhile to spend time on that because it might lead to open frontiers.
So what I want to mention here is that the at the end of this lecture the mental model which you need to carry with you is that everyone stares at the model the architecture the parameter count the leaderboard that's where the attention and the hype is but remember that the model never actually touches your real data. it only touches the vectors. So it's very important to understand the process which converts the data to vectors.
What this means is that a huge amount of outcome is already decided before the model runs at all in the tokenizer. How you chop and how you convert it into vectors.
So when something isn't working, maybe the fault is not in the model itself.
Maybe the fault is in how the data is tokenized and how it's vectorized. How is the text tokenized? Are the image patches the right size? Is the audio codec throwing away the very thing you cared about? Maybe these are the questions which you should ask. So in one figure, what I want to say is that most people stare at the model, but maybe the real work is happening before the model in the vectorzation step.
So the next breakthrough might come in the tokenizer and the vectorzation state where almost no one is looking. But I feel that vectors all the way down is one of the most important mental mental models of artificial intelligence. It's incredible how far we have got with vectors and matrices. And vectors and matrices even shape how GPUs are made.
Right? GPUs now run all modern AI uh models. And the reason GPUs work so well and they're so fast is because they leverage matrix multiplications being parallel operations. Again it comes down to matrices and vectors. So the more you appreciate vectors and matrices the better you will understand artificial intelligence. I believe there are different types of people. If you look at the attention mechanism right if you look at the attention mechanism it involves several matrix multiplications like which have been mentioned over here right now. It involves several matrix multiplications inside the let me just go to this image.
Yeah, if you look at the attention mechanism, there are several matrix multiplications. Now, if you are comfortable with vectors and matrices, you'll see that almost all modern stacks or almost all modern generative architectures have matrix multiplications and vectors at the heart of it. It's really the building block of everything. So before even starting out to understand artificial intelligence you should get comfortable with vectors matrices and then tensors. I've seen that the people who are comfortable with writing matrix matrices on paper on their own they have the strongest foundational knowledge of artificial intelligence because ultimately it all comes down to vectors and matrices. It's vectors all the way down.
So put some time in writing matrix matrices on the whiteboard or on a piece of paper. It takes time but I assure you that this will lead to a much faster or it will lead to a much better understanding of several aspects of attention mechanism. Not just attention several aspects of artificial intelligence. It will also help you read and understand research papers easily because if you look at now the deepseek paper which became very famous. It's a very difficult paper to read but it just gets a bit simpler if you are you know uh so I don't want to show this paper but I want to show this paper. It's very difficult to understand but it just gets a bit simpler if let's say if you take a look at this paper and if you scroll down right yeah all of these things here these are matrix multiplications but if you're intimidated by this you will not be able to understand this paper on the other hand if you look at this and if you understand its dimensions or if you ask a coding agent or AI model about its dimensions and you write this matrices on a piece of paper you you write the multiplications on a piece of paper you'll see that okay this is the latent KV cache etc etc then you'll start to understand this a lot better and you'll also feel a lot more confident once you understand vectors and matrices thank you everyone I hope this these lectures are useful to you I'm trying to make them very creative without going in too much technical detail I just want to keep this at a great mental model level so that when you are faced with a new problem or when you are discovering something or attempting to debug something like Here we are attempting to debug and we found that maybe the fault is not in the model but in the tokenization or vectorization. I hope that this mental model series helps you in a very broader way in research in building products in solving AI problems when you are stuck building projects for industries etc. The next mental model which we are going to look at is compression. I'll see you in the next class.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











