The attention mechanism in language models captures relationships between tokens by computing attention scores through dot products of query and key vectors, where each token's query vector determines how much attention it should pay to other tokens' key vectors, with softmax normalization converting these scores into weights that are then applied to value vectors to produce context-aware representations; multi-head attention further enhances this by using multiple parallel attention heads to capture different semantic perspectives simultaneously.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
LLM Interview Series : What is Attention?
Added:Hello everyone and uh welcome to this video. Today I'm going to look at an interview question which is uh at the heart of language models and which is really at the heart of why coding agents around us work so well. why Chad GPT, Claude, Gemini they work so well and how they are able to capture the context of what humans are essentially speaking and that question is you go to an interview and the interviewer asks you can you explain to me what is attention Let's say let me bring this a bit towards the center. Yeah, just a minute. I'm just going to adjust this bold a bit.
Yeah, let's say the question which is asked here is that what exactly is attention and uh as we have covered so many times in this interview series, there are actually multiple levels of depth with which you can answer this question. So today I'm going to cover two levels of depth. Let me just zoom out a bit actually.
Oops.
Yeah. Okay. So, today I'm going to cover two levels of depth. The first level of depth is going to be what I will right now call as conceptual.
And the second level of depth is what I'm going to call at a matrix level.
So first let's try to understand what is the need for the attention mechanism at a conceptual level and uh I like to take some sentences for this so that uh we can illustrate a bit better. So let me uh take an example um let's say Harry boarded the train and he let's say this is the example which we have taken and currently what we really don't know is that if you look at this token he how does the model know that he is actually related to Harry, how does the model know that? Keep that example in mind. The second example which I want to talk about is let's say uh you say something about uh India and you talk about India and later in the um course of your speaking you refer what is the capital of this country.
So somewhere the context needs to be stored right that we are talking about India here. So country essentially refers to India. Okay. Then let's say if you're writing a piece of code and let's say someone says that hey make some changes in the first 10 lines or in the first 20 lines of code. Somewhere that context needs to be captured. the context of what comes at one point in a paragraph or in a sentence and what has come before it.
You know the way humans do right when someone is speaking to me I don't remember everything which they are saying really but I have some context in my mind so I can link later if they ask me some questions based on what they have just told me I have some context which is stored in my mind essentially what when I refer to context I mean words the model needs to capture relationship between one word and its neighbors.
So we need to capture the relationship let's say between he we need to capture the the relationship between he and the neighbors which come before he like all of these tokens that relationship needs to be captured somewhere and not just that once we capture that relationship we should also give weightage to which tokens are more important when you look at a specific word and which tokens are not that important when you are going to look at a specific word that is also important. So let's say you take another example. The the dog the dog chased the ball. Let's say it could not.
Let's say it could not catch it. The dog chased the ball. It could not catch it. So if you look at this it if you look at this it and if you compare it with the past tokens and then you are trying to find where I should pay more attention to this it should pay more attention to the dog whereas this it over here this it over here should pay more attention to ball.
So the first it actually refers to dog and the second it actually refers to ball. So when you look at tokens and when you look at the past tokens and if I am a token right and if I look at past tokens I need to know which tokens I need to pay attention to and how much attention I need to pay at each token.
So the attention mechanism which we are going to look at is going to answer these questions that which tokens I should pay attention to and essentially how much attention I need to pay to the tokens which come before me. I hope all of you have understood um why we need the attention mechanism in the first place. The reason for the attention mechanism is without the attention mechanism in the language model architecture, tokens would have no clue about their neighbors.
So how how will the context be captured?
Um let's say you talk about a detective story and then you talk about some other thing like let's say you talk about geography later and then you again ask questions about the detective story.
There needs to be some context between there needs to be some capturing of what you have asked currently and what has come before.
So you cannot treat all the tokens as independent. There has to be some linkages between tokens when you interact between model. So what the attention mechanism is actually doing is that it is capturing the links capturing the links between tokens.
The attention mechanism is actually capturing the link between tokens. It captures the link between a given token and all the tokens which come before it.
Now, how does it do that? So, that's where the mathematics actually comes and all of you should actually be able to write this down on a board. So, now what I'm doing, I'm looking at the mathematics of attention.
So then the question is when I'm looking at a particular token, how do I know how much attention needs to be given to the previous token? That's the question which we are solving right now. Okay. So we are going to look at the attention mechanism now. But first let us place the attention block in the context of the whole LLM architecture itself. So yeah. So let's say oops let me actually change the color here.
Yeah. So let's say this is my whole LLM architecture.
Now within this there are three blocks right? There is the input block.
The input block what does it contain?
You have the input tokens. Let's say Harry left the train or Harry boarded the train.
Let me say Harry boarded train just or just two words Harry Bed. That's the input. You have the token embeddings.
You have the position embeddings. These two are added to give the input embeddings.
That's the input block.
This input block then goes to my transformer block or my processor and my transformer block consists of first is my layer normalization.
Then this is the part where the attention block sits. This is called as multi head attention.
Then we have a dropout.
Then we have a shortcut connection.
Then we have a another layer norm. Then we have a feed forward neural network.
Then we have a dropout.
And then we have another shortcut connection.
This is the transformer block. And then we have the output. In the output block, what we have is we have a layer normalization and we have the logits.
And the result of all of this is that what comes after this is the next token.
So really when someone asks you what exactly is attention and can you explain the attention mechanism to me what they are really asking is can you explain this block to me? Can you explain what is happening within the multi head attention block. So now when we are going to do the subsequent analysis we are assuming that there are some input which is coming to the multi head attention block and let's say that input vector is X which is the input to that multi head attention block and it's essentially Harry boarded the okay and let's say I'm assuming that each token is now a fourdimensional vector Right. So this is my input vector. Now sorry this is my input matrix and that's a 3a 3x4 matrix. This is my input matrix which goes to the attention block. Now remember that before we come to the attention block the tokens essentially have no relation between each other. the to the tokens have not captured any information uh related to each other. So when we say boarded it has no information that it should relate more to Harry than to the all the tokens are processed individually up till this point in the LLM architecture. In fact the attention block is the only place where tokens actually learn the relationships between each other. So what happens here is that you have this uh let's call this X then comes the mechanics of the attention mechanism. Okay, you have three trainable matrices. Immediately at the start we are going to call them the trainable query weight matrix which I'm going to so the first weight matrix is the trainable query weight matrix. Then we have the trainable keys weight matrix and we have the trainable values weight matrix.
What do I mean trainable weight matrices? Well, it means that these matrices the values are not known before. We are outsourcing it to gradient descent since we do not know what should be the value in these matrices. Now, usually what are the dimensions of these matrices? Well, the input dimension is constrained because the output dimension here should be the input dimension here. So these are going to be 4x4.
Actually the output dimension of these trainable weight matrices can change but here I'm assuming the the output dimension to be four which is the same as the embedding dimension of the input.
Now when you do this multiplication what results out of this is what results out of this is the query matrix the key matrix and the value matrix and what are the values of this query and the key. So the query matrix will now again be 3a 4 right.
So it has three rows and four columns.
Why is it going to be three square all these matrices 3a 4? Well, because that's how matrix multiplication works, right? The number of rows here and the number of columns here. That's the dimensions of the matrix multiplication.
So all of these matrices are 3a 4.
Now if you look in terms of meaning, right?
So there are three rows to each of these matrix each of these matrices and each row corresponds to the tokens here. So the first row corresponds to Harry second row corresponds to bordered and third row corresponds to the the query matrix is where we essentially start with. Remember when I gave you the examples at the start I I always said that when you look at a token how do how do the previous tokens matter? So this when you look at a token is my query.
The current token which I'm looking at is my query vector. Right? And there are three query vectors as you have seen over here.
The first query vector is for Harry.
The second query vector is for boarded.
And the third query vector is for the I have three query vectors here. For each of these query vector I need to find out how much attention I need to give to the other tokens.
And how will I do that?
you take the query and you take the keys and then you do the dotproduct of this.
So when you take the dotproduct of the queries and keys transpose I'm not going to show you at a matrix level. I'm showing you at individual vector level.
So first if you want Harry right so Harry is a Harry is a 1x4 Harry is a 1x4 vector. I'm going to multiply it with the keys vector keys matrix which is 4x4 and what results is a 1x4.
So what results is a 1x4 over here and essentially this is the attention this is the attention scores vector for Harry.
This is the attention scores vector for Harry and every to every value here corresponds to what should be the attention score between Harry and the past tokens. So now this token so this token is Harry um wait actually I made a mistake here.
This should be let me go back the dimensions are a bit incorrect over here.
Going back. I need to figure this out.
How to do it in a faster manner.
Yeah.
Okay.
So this should be actually 4a 3 because I'm doing the keys transpose right. So the keys matrix is 3a 4 and the keys transpose will be 4a 3. And when you multiply this 1a 4x 4a 3 you will get a 1x3.
Now this is the attention.
This is the attention.
This is the attention scores vector for Harry.
This is the attention score vector for Harry. And now we can see what each value here represents. So the first is the attention score between Harry and Harry.
The second is the attention score between Harry and boarded and the third is the attention score between Harry and the.
So these matrices are not trained. the W, Q, WK and we are not actually trained. But once the language model model training finishes, the attention scores between Harry and Bordered should be the highest. So this will be the highest. Uh and the attention scores between Harry and Harry and Harry and the will be the lowest. There is one thing which I have not mentioned here and that is causality. So technically token should only refer to tokens which come before them. Right? So if Harry is my query, Harry can actually not Harry should not peak into the future. So we should not calculate the attent we should not calculate these two attention scores because it's like cheating in language models when we train when we pre-train language models we are doing the next token prediction task. So we cannot actually look at the future tokens. We can only for a given token we can only look at that and the past tokens. So for Harry really we can only get the attention score between Harry and Harry. That's this. Now let's look at boarded.
When you look at boarded, when you look at boarded, that's again 1A 4, you can write this along with me. That's again 1a 4. And you multiply it with the keys transpose which is 4a 3. And out comes the attention scores vector for boarded.
Now if you look at the attention scores vector for boarded.
Uh this is the attention scores vector for borded.
And each value here again represents the attention scores between boarded and the neighbors. So this is the attention score between boarded and uh Harry. This is the attention scores between boarded and bolded. And this is the attention scores between bolded and the. So remember we cannot peak into the future. So we cannot compute this attention score. But we can compute this and this. Now in a fully trained language model the attention scores between boarded and Harry will be very high. So we expect this value to be very high when the model is actually fully trained.
And of course the attention scores between boarded and borded will definitely be high because those two are the uh those two are literally the same tokens.
And then finally, and then finally we have the last token which is the and then the will also be a the will also be a 1x4 and that will be multiplied with the uh keys transpose which is a 4x3 and the result is 1x3. three.
So this is the and these are the attention scores between the and all of its neighbors which is basically Harry Bed the and in this case since the is the last token here we can technically find all of these attention scores. It's not cheating.
So here what I have shown you is that I've shown you for each vector separately for each vector separately how do we get these attention scores right? So we get one one vector for each token.
What people usually say what or how people usually demonstrate this is that you have this queries and keys, right?
You multiply the entire query. You multiply the entire query with the keys transpose.
You multiply the entire query with the keys transpose, right? So this will be 3a 4 multiplied with 4a 3 and the resultant matrix will be a 3x3 matrix.
But each row of this 3x3 matrix is what we have already computed before. The first row of this is the attention scores for Harry. The first row of this is the attention scores for Harry which we have already computed over here. This the second row here is the attention scores for boarded which we have already computed over here.
And uh the third row here is the attention scores for the which we also have already computed over here.
But you need to understand where these three rows come from. And that's why I showed you the operations as they happen for every single row. So when you look at attention, so this whole matrix now is called as the attention score matrix.
This whole matrix is called as the attention score matrix. Now when you look at the attention score matrix and if the interviewer asks you what does this mean? You should be able to say that every row let's look at each row of the attention score matrix. If you look at the first row, that's the query vector. And each value gives me how this query attends to the keys like how Harry attends to Harry Bed the. So the first value, the first value is the attention between Harry and Harry. The second value is the attention between Harry and Bolded. The third value is the attention between Harry and the.
Similarly here the first value is the attention between boarded and Harry.
Second value is the attention between boarded and boarded. Third value is the attention between boarded and the.
Similarly here the first value is the attention between the and Harry. Second is the attention between the and boarded. And third is the attention between the and the.
That's how you interpret the attention scores matrix. And now remember what we have seen before right which of these values can actually be computed we cannot peak into the future right. So technically let me rub oops so technically um technically we can only compute this this and this. So we can draw a triangle like this and we cannot compute this. We cannot compute this because we cannot peak into the future. What I'm showing right now this triangle is actually called as causal attention.
The reason this is called as causal attention is because we cannot really peak into the future. Um we only can look at the tokens which have a causal effect which means the past has a causal effect on the present whereas the f future does not. So we cannot peak into the future. We can only look in the past. So all the tokens can only attend to the tokens which come before them.
That's the language to be used. All the query tokens can only attend to the keys which come before them. Okay. So this is the second way to explain about the attention score. So currently if you see if you look at the question what is attention? It can be explained in two ways. First is well it can be explained conceptually such as the attention actually captures the link between tokens and the second way is you definitely need to be able to explain attention in a mathematical manner like this. Now there are several offshoots which can happen to this question. The interviewer may ask you how to get the attention weights from the attention scores. Then you might explain softmax.
Then they may ask what is the softmax?
Then you might explain that etc. The interviewer may also ask you um what is multi head attention and where does it fit into what you have shown right now.
So just at a broad level what you can answer to that question is what multi and maybe we'll have this as a separate question otherwise this one video will become too long but what is done in multi head attention is that if you look at currently what I showed you was the workflow for single head attention in multi head attention actually what happens is that you have this trainable matrices right you have the trainable query key and the value matrix this is split so if you want to have two heads you split it into two matrices. So you have WQ1 and you have WQ2.
This is this is what I'm showing for two headers. This is W K1 and this is W K2.
This is W V1 and this is WV2.
you essenti you essentially split it into two parts and then you proceed with the exact same calculations for these separate matrices. What will happen as a result is that we will have two query matrices.
We'll have Q1 and Q2.
We'll have K1 and K2 and we'll have V_sub_1 and V_sub_2.
What this will mean is that Q1 into K1 transpose will lead to one attention score matrix. Q1 into K1 transpose will lead to one attention score matrix whose size will be again 3A 3 and Q2 into K2 transpose Q2 into K2 transpose will lead to another attention score matrix.
So both of these the size will be 3a 3 which is the same size as what we had got in the single in the single head attention but the idea is that we have two attention scores now that's the main purpose of multi head attention instead of one attention score matrix we have two attention score matrix why is that done because each attention score matrix can capture a different perspective let me give you an example what do I mean by different perspective Let's say the let's say the sentence is the artist.
The artist painted the artist painted the portrait of a woman of a woman with a brush.
What does this sentence actually mean?
Does it only have one interpretation?
Well, it can be like the artist could have painted the portrait of a woman. So, this is the brush. Let's say the artist could have used a brush to paint the portrait of a woman. That is one interpretation. The second interpretation, the artist actually painted the portrait of a woman who had a brush in her hand.
There are two interpretations, right? So if you just had one attention score matrix, then maybe the attention score between brush and artist would be high in this case.
That's not very good because we need to capture both interpretations. Now if you have two attention score matrices like in the multi head attention there is provision for another attention score matrix to actually capture that there can be another variation of this where the attention score between brush and woman is high.
So one attention score matrix captures one perspective another attention score matrix captures another perspective.
So as you can see after this point the interview can diverge into many different places. It can diverge into multi head attention. It can diverge into soft max. It can diverge into uh how to get the context vector matrix. It can diverge into multi-query attention, group query attention, multi latent attention whatever. But this is the beginning foundational point of how do you begin answering such questions. You need to answer it with passion and with depth. The first part is the conceptual part which you need to definitely mention and that comes from the joy you have of the subject and the second part is the mathematical part which I just explained right now and you should be able to explain the mathematical part like I did on the white on the blackboard or on the piece of paper that that conveys very strong domain knowledge and again when you answer something try to ground it in a bird's eye view right don't directly start start exploring the attention mechanism.
Show that the attention mechanism actually is here in this whole bird eye view. Then start explaining the mechanics and then stop. And once you stop over here, the interviewer will naturally ask you what happens after this. What is the attention weights?
What is soft max? So then it's like you are actually leading the interview rather than just answering something quick and waiting for the interviewer to ask the next question to you. This conveys depth and it conveys passion.
But you need multiple perspectives with which you should definitely be answering interview questions. That is what I want to convey in videos like this. And that will only happen when you actually write this in on a piece of paper. I strongly believe that in the age of artificial intelligence, the more you write these concepts down, the more you think about this. So when someone asks about multi-head attention, right? One way is to just say that there are multiple heads like this. If you split the queries, keys and the values into multiple heads. That's it. A person who has thought of about examples themselves would know would probably have thought about what is the actual conceptual need for multi attention and they would probably come up with an example like this.
So independent thinking, original thinking also actually improves interview answers. So it's not just like when an interview is coming you start preparation 1 month before that when you start studying the subject itself as you are taking walks as you are alone you can think about what it actually all means and that will pay off when you are doing the interview just last one month of preparation will lead to quick knowledge but it won't form patterns in your mind it won't form these analogies in your mind for these analogies to form you need to start thinking when you learn the subject itself And for that you need to have a passion about the subject. I hope all of you found value in this video. I'll continue making subsequent videos like this so that it enriches your interview preparation.
Thank you everyone and I look forward to seeing you in the next video.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











