Subham Kharwal delivers a remarkably lucid breakdown that strips the mathematical intimidation from the attention mechanism, making it accessible without sacrificing technical rigor. It is a masterclass in clarity for those who prefer functional logic over abstract academic jargon.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
How Transformers Work - Attention Explained Step by Step | Chapter 06
Added:On my screen you can see the very popular Transformer architecture. This is from the paper attention is all you need. Today my goal is to make you understand this whole architecture and why Transformers are the secret sauce for LLMs and all of the frontier models that you see today.
By the end of this lecture you should be able to identify and understand each block that you see on this architecture and that is the whole goal of this video. From our past video we understand that in order to predict the next token we provide the text as input into the LLM. That input text is converted into tokens. Then unique token IDs are assigned to those tokens and based on the token IDs a lookup is done on the embedding matrix and a vector embedding is generated. And within those embeddings we add another embedding which is positional embedding to ensure that each of the token identifies its position in those text.
Once this is done we fed this into Transformers and this Transformer helps us to identify what would be your next token or the output. And today we are going to cover this Transformer here. We are going to understand what is exactly inside this Transformer which helps you to predict the next token.
So, hello everyone and welcome back to the chapter six of Gen AI for data engineers and today we are going to cover Transformers in great detail.
Transformers are known to be scary at first but please make sure that you watch this video very carefully. But believe me once I explain you everything it would be much easier for you to understand how Transformers work. So, please make sure that you watch this video till the end. If you are stuck somewhere and if you are not able to figure out please rewatch that portion again to understand that concept. In the end of this video I'll also provide some of the references that you can go ahead and check in order to understand more about Transformers. And before we can start with this video it is important that you understand how vector embeddings and tokenization work. So, if you have not seen our past video, I would recommend you to go back and watch that video first.
So, without any further delay, let's begin with transformers. Till now, we know that each of the token that we have is converted into something which is called embedding.
Now, it is important to note that embedding can be denoted in a row format like this or in a column format like this.
But, both of them are same. Okay? You might see embedding getting denoted like 1.2 -0.9 like this or you can see it like 1.2 -0.9 like 0.8. You might find both of them being used for embeddings. This format is majorly being used for calculations, but you would see this format being used in the documentations or the papers.
Okay? But, both of them represent the same thing. In each of the value that we have in a embedding, they represent one of the dimension.
And that we already know from our previous video what exactly are dimensions. Now, if you're still not sure, please go ahead and watch the previous video on tokenization and vector embedding.
Now, transformers are math machines.
Everything that we know, neural network or transformers, everything is math machine. It means there is some computation or math is always happening inside.
In transformers, basically metrics multiplications are done.
Okay? And we already know metrics from our primary education. A 3,2 matrix would look something like this. It would have three rows and two columns. So, it can be 1 5 -2 -1 0 2. So, it would look something like this where it would have three rows and two columns. Now, in order to understand transformer well, you have to do a lot of math calculations. I'll try to make you understand how the math works, but if you are confused, you can ignore the math part. Still, you would be able to understand how transformers work. Just to keep one thing in mind, for matrix multiplication, consider you have one matrix of size m cross n and another matrix you have of n cross p, then your output would be m cross p. Okay? So, this is how matrix multiplications work.
So, if your matrix is 3,2, the other matrix is 2,1, then your output would be 3 cross 1.
Okay? So, this is how matrix multiplications work. You just have to keep this in mind for this whole course.
And in this whole course, we are going to take example of GPT-2 because it is easier with smaller numbers to understand transformers. And GPT-2 has 768 dimensions that we already know from our previous video. And this is all that you need to know before you can jump into transformers.
Whole of this chapter would rest on a simple, or you can see one big idea.
Just read this statement on my screen.
It says, "I left my phone on the bank."
Right now, we are not sure what this bank represents. Whether it can be a river bank or it can be a bank of money.
Right? We are not sure until we can see what is there in the statement. So, as soon as I say, "Right next to the river," we understand that this bank represents river bank.
Right? So, in order to enrich the meaning of this bank, we have to look at all other words that are there in the sentence. And that is what is called attention.
You look after all of the words, or rather token, to enrich the meaning of a particular token or a particular word.
Now, right now, this bank represents river bank. So, think of this. We have embeddings created from tokens, right?
So, before we feed those embeddings into transformers, those embeddings have a generic value. For example, it would have a value like 0.2, 0.9, or say -1.2, like this. Okay, with 768 dimensions.
Because we are talking about GPT-2, so the total number of values inside this would be 768. Okay? So, this embedding would be a generic which would only represent a bank, but it would not represent a river bank or a money bank.
So, in order to make this one enrich with the meaning of river, we have to add some values to it.
Okay? And then it would point to an embedding which would point towards river.
So, let's consider this in a 3D plane.
So, if this is my 3D plane, so this is Y, this is X, and this is Z.
Consider this is your bank on 3D plane, this is the embedding representing bank. Now, if this is the embedding for river bank, then we need to add something in this bank. So, this is the extra portion that we would add to the embeddings of bank in order to make sure that it points towards river bank. And this is how we enrich each of the embedding within transformers to make sure that the point at the correct direction based on the meaning that they understand from the whole statement. And this is the only and the simple idea based on which the transformer is built upon.
It has a lot of steps that it perform in order to enrich each of the token with a meaning that it understands from different tokens that are there within that same statement or the context it has. And this is the very simple idea that would help us to understand how transformers work.
The first transformer that was built in around 2017, that was used for only one purpose, that was translation.
And translation splits into two parts.
The first part is understanding and the second part is generation. It means first you have to understand the whole statement and then you have to generate the words one by one. And if you notice in our transformer, we had two blocks.
The first block was encoder and the second block was decoder. The first block was used for understanding. Where it used to understand the complete meaning of the input statement. And the second block was used for generation. It generated the complete output statement one word at a time.
So, this was the whole transformer architecture where it took an input which is the cat is black in English and then it converted it into French. There are two blocks. The first one is encoder which is used for understanding and the second block decoder was used for generation.
Let's talk about encoders.
The main purpose of the encoder was to take some input vectors and make them rich and context aware. And they can only do it by looking both on their left and right. Let's take the same example which is the cat is black. So, I'll just write the cat is black. Now, each of the word that you see here or the token is inputted as vector embedding, right? So, consider this are all embeddings. In this case, how this embedding cat would understand that it is black? Only way to do that is by looking to its right. Similarly, to get more information, it also has to look on its left. Consider the word embedding black. How it would understand that it represents a cat which is black?
Only by looking on its left. So, each word that you see on encoder was allowed to look both left and right. Okay? Now, when I talk about word, you can think about those as embeddings. So, each vector embedding which was as input was allowed to look both on its left and right in order to make them rich and context aware.
Just try to remember the first example that we used. The word bank, it can only understand that it represents river bank is by looking towards its right.
Similarly, river has to look on its left to understand that we are only talking about the river bank, nothing else about the river. So, in order to make each word rich and context-aware, they have to look both left and right. And that was done in encoders. And this is why encoders were good for understanding.
Now, let's talk about the second part, which is decoder.
Now, in case of decoders, they are used to generate one token or one word at a time in order to complete a statement.
So, if we take the same statement, which is the cat is and I'll let the last token be empty to predict. In this case, each token can only look on its left in order to generate the next token. There's no point of looking on its right. So, in order to generate the next token, the only way they can do is only looking on their left. And this was the basic difference between decoder and encoder.
Where encoder allows each of the vector embedding to look both left and right in order to make them rich and context-aware, decoder allows each vector embedding to look on its left in order to identify or predict what would be the next token. And that is the basic difference. And this is why decoders were used for generation.
Now, there's a big realization that happened. Both encoders and decoders are built with the same building block, which is called a transformer block.
They are just stacked on each other. So, you can think of a encoder consists of multiple transformer blocks like stacked on each other. And that is the same case for decoder. The only difference is in both of the case inside the transformer block, they're only allowed to look left and right. Okay? In case of encoders, the embeddings would be allowed to look both left and right. And in case of decoders, the embeddings would be allowed only to look on the left. And that is the sole difference between encoders and decoders. And both are built with the same Transformer block.
So, if we are able to understand how each Transformer block works, then we should be able to understand how encoders and decoders work. And that is what we are going to do next. We are going to understand how each Transformer block works. What are the steps involved inside a Transformer block that make sure that each of the embedding is rich and context-aware.
Before we can jump into understanding a Transformer block, one important point to note here is we have already discussed about it. All frontier models that we know today, whether it is GPT, Claude, all are built with a single portion, which is decoder. Because the only job that GPT and Claude does today is generation. They are all doing generation, which is predicting only the next token. And that is the sole job of decoder. So, that was the biggest realization that was done. Since both are built with the same Transformer blocks stacked upon each other, we can very well use only decoders in order to generate the very next token. We don't even need encoders. And that was the realization that happened. And because of that, there are three families of model. The first one are encoders, which only consist of encoders. The second one are decoders, which are used for generation, which are Claude GPT. And the third one are Transformers, which use both encoder and decoder.
And the important use of Transformer is for translation.
So, if you want to understand something, the models like BERT use encoder. If you want to generate something like GPT, they use only decoders. And if you want to do translation, you use a Transformer, which consist of both the blocks, which are encoder and decoder. And this are the three families of models that are available today.
So, if we look at the whole big picture, it would look something like this. You have an input text which says the cat sat on the mat. We are going to generate the next token here. We do a tokenization and embedding which is our very first step, right? So, we do embedding which is tokenization plus embedding and then we also add positional embedding to it. And then one vector per token is done. So, each token is converted into one single vector embedding.
And then those are fed into transformer for generating the next token. Now, within transformer there are multiple transformer blocks which are stacked on each other to make sure that each of the embedding becomes rich and context aware. So, you can see there is one block here, second block here, and similarly there are multiple blocks which are stacked on each other. For GPT-2, around 12 blocks are stacked on each other. For GPT-3, that count is 96.
And we are going to talk about this in detail in a few minutes. For now, you can just think of for GPT-2, there are around 12 transformer blocks that are stacked on each other. And for GPT-3, there are 96. And once the vector comes out of all of the transformer blocks, it is within the same shape. So, if we have 1 2 3 4 5 6 vectors going in, so there would be six vectors coming out of the transformer block. Now, the only difference would be that they are very rich, means they would be pointing to the correct direction in the dimensional space. So, each vector would be now very rich and context aware. And then that would be fed into an output head which would convert those vectors into word scores.
And then it would allow us to predict our next word which is dot. So, output head would convert the last vector into scores and those scores would help us to determine the probability which would give us our next token. So, this is how the whole flow would work. We are going to talk about one transformer block that would help us to understand how it works. And that same transformer block is repeated multiple times in different frontier models. For GPT-2, it is repeated 12 times. For GPT-3, it is repeated 96 times. So, we have basically three steps. The first step is embedding. The second step is stack of transformer blocks. And the third step is output head, which actually gives us our next token.
And the same process is repeated again and again in order to generate our whole text. So, predicting the next token is the only work that decoder does.
And that is what GPT does as well because those are based on decoders. And now, we are going to understand the transformer block. For the very first time, you might find transformer block a little difficult, but make sure you watch this video or that portion again to make your understanding clear.
We are here at the most important part of this chapter in order to understand transformer. We are going to understand one of the transformer block and how it works.
On my screen, you can see a lot of steps written. Don't get overwhelmed with it.
We are going to cover each one of this in detail. Before we can start with the transformer block, I just want to remind you one more thing. So, each token that we have is converted into embedding.
And before feeding the embedding into transformer, it has a generic meaning.
If you remember our previous example, which was bank, before this embedding is fed into transformer, it contains a generic embedding values, right?
So, it has a different values, and each value represent a dimension.
And from our previous chapter, we already know that dimensions are used in order to capture meaning for each of the word. And that is how meanings are captured within the vector embeddings.
So, before feeding this bank into transformer, it has a generic meaning.
Transformer allows to add some more embeddings within it to make sure that it becomes rich and context-aware.
So, what does it mean?
It mean this bank would be now either river bank or money bank based on the statement or the context we provide. Okay? So, this bank might be pointing to some direction in the dimensional space. After putting it into transformer and making it rich, so the embeddings would change, but remember the dimensions would still remain same.
So, earlier it was 768 for GPT-2. So, the output will still be 768.
But, now it is more rich, so it would be pointing into a different direction, which would be towards river bank or money bank depending on our statement or the context that we provide. So, in our case, in our example, we lost our phone on the river bank. So, now this embedding for bank would point towards river bank. Okay? So, this is how this whole things work. Now, how we move from bank to river bank? The only way we do this is using attention.
Where all of the embeddings within the statement or the context is allowed to look to different embeddings or context in order to enrich themselves with the meaning that is required. So, that is how this whole things work. And we are going to cover all this in step by step how transformer does it.
So, from this list we have already covered input embedding because this is what is being fed into transformer. Till now, it has generic meaning. So, it is not pointing to the correct direction.
Now, what is done to this is once the embedding is there, it is normalized. We already know normalization means bringing down or tuning down the numbers properly. So, your embedding might have numbers like this. One is 800, one is 0.9, one is 1.2, one is say 900. Okay?
So, if you notice here, some dimensions have very big numbers like 900 and 800.
So, they would just overpower all other dimensions in this embedding. So, normalization is a step which actually would tune down all of the dimensions properly. Okay? In the same range. It can be from -1 to 1. So, within this range, it would tune down everything so that it can be represented within a range and they do not overpower or overshadow different other dimensions.
So, that is the step which is normalization. So, once we have our generic embeddings, those are normalized before putting it into the transformer.
So, this is the step which says layer normalization. It simply means that we are normalizing the values within a range to make sure that no other dimensions overpower any other one. And after normalization, the next step is attention, where the magic happens. Now, if you notice, I have written something called multi-head and we are going to discuss about it next.
In order to understand this part, attention is all you need. So, you have to make sure that you listen very carefully to understand this part. If you are able to understand this part, then everything is done for you in the transformers.
So, let's take an example where we have a text the tired cat slept.
So, we already know that we convert the tokens into embeddings. So, there are four embeddings that would get converted. And we are talking about GPT-2 here. Okay? So, any number that I talk about next would be for GPT-2. And this stays true for all of the frontier models.
I will take example of GPT-2 because the numbers are pretty small like 768. So, it is pretty easy to understand.
So, the vector embeddings for each of the token are represented here in 768 dimension. And right now, this are all normalized.
So, we are done with our second step.
So, all of the vector embeddings right now are normalized.
Now, how would each embedding would enrich itself? For example, the cat, how would it would enrich itself saying that I am tired or I slept? Okay? So, we are talking about encoders here, so we will look both left and right. So, in this case, cat looks around all of the different embeddings that are present to make sure which are all relevant to it.
So, cat looks at tired and sees that it is very relevant to it because it describes the cat. The cat is tired.
Then, it looks at slept and it says that this is what I did. Once I was tired, I slept. And then, it looks at the and the is not so relevant for it. Here, tired and slept are very relevant to the cat.
So, now once cat has understand the relevance of all other different embeddings, it would enrich itself and now the meaning of the cat would be very tired sleepy cat.
Okay? So, it would enrich its embedding to represent a very tired sleepy cat.
So, now the generic embedding of the cat would represent to a more specific direction in the dimensional space. It would no longer be a generic one.
Rather, it would be a more specific one which would represent a very tired sleepy cat. So, the cat would update itself from blending a lot of tired, some of slept, and a very less amount of the. Okay? So, the cat would add some embeddings which would have higher value for tired, a little lesser for slept, and a very less for the.
Okay? and this is how this embedding will get updated. And this is how attention works.
Now, I'm going to talk about attention in much detail and how it works actually inside the transformer. So, I want you to pay very close attention to it.
Now, for GPT-2, we know that each embedding is represented with 768 dimensions. It simply means that we have an embedding where we have around 768 numbers. Each number is representing a different meaning or a different dimension. And we have already understood this with an example in our previous video.
Now, enriching each of the dimension can be done separately because each of the dimensions are not correlated to each other. For example, if you remember the previous example that we took in our previous video where we represented kitten, cat, shark, and tiger in four dimensions. One of the dimension was size and the other one was dangerous.
Now, size does not have any correlation with dangerous. We can enrich the number which represents size separately and we can enrich the number which is danger separately. So, the dimensions does not have any correlation between them. So, we can process them separately. And this is what is done using multi-head. In each of the head, we process some of the dimensions and we try to enrich their numbers.
In GPT-2, we break down the 768 dimensions into 12 heads so that all the heads can be processed in parallel. So, if we break down 768 into 12, so each head would be processing around 64 dimensions. So, they can enrich 64 dimensions each separately on different heads. Now, if I break down this, consider one of the head might be responsible to identify nouns and enrich the number for nouns.
One head might be responsible to identify actions and update the number for actions.
Some head might be responsible to understand the articles and then update the values for the articles. When I say article, it simply means A and the, okay? So, different heads have responsibility to update different set of dimension numbers and enrich them. Once all of the heads complete their computation and enrichment on the embedding vectors that they have, we will combine all of them to again make them 768 word embedding vector. Okay? And this happens for all of the words that get into the transformer.
And here in this example, we have four tokens that are getting inserted into the transformer. So, all of them would be broken down into 12 heads.
So, each one would have around 64 dimensions to process. And if you notice here, since we are processing four of them, so we have a context of four. Now, context length is the maximum number of tokens that can be processed at a single time. Okay? So, some of the frontier models have around 1 million tokens that they can process and do all of this.
Now, let's talk about the math behind it. How cat actually understand that it has to take more of tired, less of sleepy, and very less of the?
So, in order to understand that how cat understands that tired is more relevant and sleep is a little less relevant and the is very less relevant to it, consider cat is at a networking event of word embedding vectors.
In that networking event, cat actually asks a question, which is a query.
Who is relevant to cat? There are other word embeddings available in that networking event, which are tired, slept, and the.
Now, tired has a key which says I describe noun.
Slept has a key which says I am an action. The says that I am just an article. So, cat looks at all of the keys of different word embeddings and understand that tired is a great match.
Slept is somehow decent match, but the does not mean anything and it is weak match. So, now based on the relevance, cat takes some of the value from each of the word embedding that is there. So, it understand that tired is more relevant.
So, it takes a lot of tired's value.
Okay? Then it adds some of the slept's value. Okay? And a very little of those value. And once it gets that output, then it adds that value to its own embedding.
Okay? So, if that is the embedding of cat, it adds that value to itself in order to make itself more relevant or more context aware.
So, if you still think this is complicated, let me give you one more example. Consider I want to search something on Google. So, my query is I want to learn LLM. Now, there are different pages that comes up on your result. Each one has some title, which is the key. Consider one page says, "I teach LLM." The other page says, "I teach transformers." The third page says, "I teach Python." Now, if you look at all the three page titles or the keys of the pages, you understand that the first one is more relevant. So, you go and open that page and read the content from that page more because that has more value. Then to the second page where you see transformers, you know it is somehow relevant. So, you go into that page, you read its content, and get some of its value. But the third one, if you see Python, it is less relevant. So, you go and open that page and take some of its value, a very little one. So, in order to make your understanding more, you take more value from the first title, some value from the second, and a very less from the third title. So, this is how query, key, and value works. And this is how transformer work. So, in each of the head where we process for GPT-2 around 64 dimensions, we have some pre-trained weights for query, so that we represent as WQ. Okay? We have some pre-trained weight for key. And we have some pre-trained weight for value. Now, your embedding, which is represented in E, is multiplied with each of the weight to convert the embedding into three pieces, where one would be the embedding query, where one would be representing the key of the embedding, and one would be representing the value of the embedding.
This is done after multiplying the weights for the query, for the key, and the value, which are already present in the head and are pre-trained. And this happens for all of the 12 heads. And we are talking about one of them. So, once we multiply the matrices of the weights with the embedding, it gets converted into three of them, and one is the embedding query, which is your "Who is relevant to cat?"
And one is the embedding key, which would represent whether it describes noun, action, or article. And one would be the value, what it has to offer.
So, for our example, we have four embeddings. The first one is the, the second one is tired, the third one is cat, and the fourth one is slept.
So, each one of the embedding would be broken down into three embeddings for key, query, and value. So, the would have its own query, the would have its own key, and the would have its own value. Similarly, for tired, we would have its own query, tired would have its own key, and tired would have its own value. Similarly, for cat, we have a query, cat, we have a key, cat, we have a value. And for slept, we have the same.
And now, since we are looking for the cat's relevance, we would use the query of the cat and multiply it with all of the keys of different embeddings. So, we would multiply CQ with TK tired K, okay?
And again, also with cat's key and also with slept key. And based on this multiplication, there would be a score created for all of them representing their relevance. So, once multiplying the query of the key of tired, for example, we get a relevance score of say one. For example, tired, since it is very relevant, will get a higher score. For example, let's take 88. For cat, since it is the same key, we are not going to consider this one.
But for the slept, we are getting somehow to say 56. And now, all of this would be passed into softmax.
Now, if you remember, softmax converts all of the scores into percentage, and the total of the percentage is 100. So, consider for the, it has taken 0.1. For tired, it has taken 0.7. For slept, it has taken 0.2, okay? So, the total here is one, which is 100%. And now, once we get the score for each of the key, how much relevant it is, then we are going to use this score and multiply with the value of that particular embedding. So, in order to get the embedding which I have to add to the cat's embedding to make it more aware, I'm going to use 0.1% of of the value of the the embedding, okay? Plus, I'm going to use 0.7% of the tired value, okay? So, let me write it as TIRV. Plus, I'm going to use 0.2 of the slept value, okay? Which is SB. So, now, whatever I am getting, I'm going to add this to the embedding to make it more context aware. So, now you know how this whole thing is calculated and this is from one head.
We are going to collect this from all 12 heads and we are going to sum them all up and then add it to the embedding.
And once we add that, so if you think it in three dimension, for example, the cat is right now pointing here, then once we add the embedding here, it would now point to a tired sleepy cat.
Okay? So, this is how you enrich the value of the word embeddings. Once you collect the value that has to be added to the original embedding, that is added and a new embedding is created. So, this is your original embedding plus you add the knowledge that you got to make it more enrich. So, your new embedding would point to the tired sleepy cat. And this is how the embeddings are made rich and context aware. So, if I scroll up and if I go where we are discussing about attention, we have seen input embedding, we have seen layer normalization and now we know how attention works. So, the next step that you see is residual correction, which is your new embedding which is added to the knowledge that you got in order to make it more enrich, right? So, this was what original embedding, you added something into it and this is your more rich and context aware embedding. So, that is residual correction where you add the value into the original embedding.
And once you have your rich embedding, which is the context aware, again we do something called normalization here.
Okay? So, if you remember the normalization where we tune down the numbers properly so that they do not overpower or overshadow different numbers, so we do normalization and then we put it into feed forward layer. So, the next thing that we will discuss is feed-forward layer.
During the attention step, if you have noticed, each embedding is looking into its left and right in order to make itself rich and aware. So, its query metrics is multiplied with the key of different embeddings, and based on the attention score, the values are multiplied, and then the sum of the values is added to the embedding to make it more rich and context-aware.
But, in feed-forward layer, the embeddings are not looking at each other. Rather, they are working on themselves. So, if you remember our example, the tired cat slept.
We have four embeddings.
In attention step, all of the embeddings are looking into each other.
Now, we are talking about encoders here.
So, this is why they are looking both left and right. The only difference is in decoders, they don't look on the right. They only look at the left. That is the only difference. So, in attention step, they actually look at each other to understand their meaning and context-aware. Once the embeddings are rich and context-aware, they are then fed into feed-forward network.
So, this is feed-forward layer.
Now, feed-forward layer allows them to look within themselves. They do not look with each other now. They look within themselves. To understand this with an analogy, so first they discuss among themselves to understand their meaning, and then they do a private reflection.
In this, they try to reduce the noise by suppressing the unwanted signals.
That simply means that they have already understood what is important for them and what is not. So, they are going to suppress the unwanted signal in this.
But, they They not looking at each other this time. They are just looking at themselves for this. So, here the dimension size is 768. We are talking about GPT-2 now.
So, this is almost multiplied with a size of 4x to increase its size almost four times, okay? And then they suppress all other unwanted signals. And then again shrink back to the same size, which is 768. So, after feed forward layer, the output would still be four tokens, which are more enriched. Okay?
Because they have done a private reflection. They understand what is important and they have already suppressed the unwanted noise. So, you will have four embeddings, each with 768 dimensions here. So, here the output is more enriched.
So, now you understand what exactly happens in feed forward layer.
After the discussion, it is time for private reflection. Each embedding looks within itself to understand what is unwanted signal and try to reduce the noise. And whatever you get as output of the feed forward layer is of the same size what was input, which is in our case is four embeddings, each with 768 dimensions. And those are more enriched.
Now, we have already covered feed forward layer. The output of the feed forward layer is again a delta that has to be added to the embedding, which is again the residual correction. So, you add whatever you get from feed forward layer to your actual embedding. Now, remember in each of the step, the actual embedding is never changed. We always add the delta to the embedding to create our new enriched embedding. So, once we get the output, that output is added again to the embedding to make it new enriched embedding. So, if I go back to the diagram where we discussed the complete flow, we have completed one transformer block. And now this transformer block is repeated again and again n times. In ChatGPT, this is repeated 12 times. And whatever you get out of the 12th transformer block, that is very much rich. And in each of the transformer block, now you understand what happens. So, let's understand the third portion, which is the output head.
Once you have the rich vectors, what happens at the output head?
Till now, we have been discussing where the embeddings look both left and right to enrich their meaning.
So, this is the property of encoders.
Now, for decoders, the embeddings can only look to their left to enrich their meaning. That is the only difference.
And this property where we do not allow it to look on its right is called casual masking.
We mask the right side so that it cannot look on the right and only can look on the left to predict what would be the next token. So, consider this example where we are trying to predict the next token. We have inputted a text which is the cat sat on the and we are trying to predict this token here.
Now, we have done the embedding for all of the tokens. We have inputted them within the 12 transformer blocks. All the transformer blocks completed and we get a rich embedding of the same size dimension, which is 768, and there are around six of them.
Now, each one of them is very rich. But, the last embedding in decoder actually has the complete meaning in order to predict the next token.
And then this last enriched word embedding would be taken and then it would be multiplied with a weight of 50,000 into 768.
Now, why 50,000 into 768? Now, if you remember for GPT-2, we have approximately 50,000 vocab, right?
So, this was the vocab size for GPT-2.
So, this is why the weight size is 50,000 into the number of dimension, which is 768. Now, the size of this is 768 into 1. So, now if we do a matrix multiplication, we would get something like 50,000 into 1. Now, if you notice here, this is nothing but a matrix where you have 50,000 rows and one column. Now, this 50,000 represent vocab. So, now if you have some scores stored in each of the row here. And this scores are called logits.
And this scores are nothing but the scores of the vocab. And this scores are then passed into softmax.
And once this is passed, you would get something in percentage.
Where some of the percentage would be higher, some of the percentage would be lower. Now, the higher percentage is the token that would be predicted out. So, if you remember from the very first discussion, in the last step of LLM, it predicts the token which has the highest probability, right? So, here the logits are the scores that are calculated based on this. Whatever enriched vector we got, we multiply this with a weight matrix of the vocab size and the dimension, and the output would be a simple matrix which would have the number of rows equal to vocab and only one column.
And each one would represent a score, which would be passed into softmax to convert this into prediction, where you have numbers which would total to one.
So, all of the total of this number would be one. So, whatever is the highest probability that you get here would be the output token. So, now you understand from the very input till the output how transformers work.
We feed in text, which is converted into tokens and then into token IDs. Then they look upon in order to convert them into embedding which have generic meaning which is then passed into transformers in order to enrich them.
Now, each transformer block consists of attention and feed forward layer and there are multiple other layers in between and those transformer blocks are stacked upon each other multiple times and the output embedding that we get is more rich and has the correct meaning that it has to represent. And in order to generate our next token, we take the last word embedding, multiply this with a weight which is of the vocab size and the dimension.
Then we get something which is called logits which is actually of the size of number of rows equal to the vocab size and one column where each number represents the score which is then passed into softmax to convert it into a probability and the row which would have the highest probability would be your output or the next token. So, this is the complete picture how transformers work.
We have taken examples of encoder to make you understand, but you just have to keep one thing in mind for decoders, we just mask the right side and it only looks on the left side to enrich itself and predict what would be the next token.
From the very beginning, we already know that in order to generate a complete statement, we have to run everything in loop where you generate the next token, again append it and again put this in input in order to get the next token.
For this example, I have the cat sat on the We put this in in transformer and the output head, it generates the next token which is mat. Again, this mat is appended in the end and then again we put this as input and we get another token which is dot. So, this loops goes on. Now, if you remember the Transformer block, each embedding is enriched, right? Consider a case you want to generate a text of 500 words.
Now, before generating the 500th word, it has to generate 499 words. And every time it has to append and put this within the Transformer plus the output head in order to get the next token.
Since the previous tokens that are already generated is not going to change, there is no point on wasting computation on again computing this 499 tokens. We only have to compute the next token. Correct? So, in order to avoid that, there is something called KV cache, which is added.
What it does is, once the tokens are generated, it stores the computation within itself. This simply means key-value cache. So, it stores the computation so that it does not have to do again computation for 499. It only has to do the computation for the 500th word.
And this is why you see the first token takes time to generate. And once first token is generated, the next tokens are streamed pretty quickly. Because it only has to predict the next token, there is no computation required for the previous tokens. Those are all collected from the KV cache.
So, the delay to this first token is known as TTFT, which stands for time to first token.
And this KV cache allows everything to go faster. And this is why all the frontier models take time to generate the first token, and all other tokens are streamed pretty quickly.
And now, if I come back to the whole workflow that we discussed from the beginning till end, you understand each piece of it. You understand what is tokenization and embedding, how Transformer block works, what are output heads, and how the next token is generated. So, let me go back to the paper from where we started, which is Attention is All You Need. Now, on the left-hand side, whatever you see, is encoder.
And on the right hand side, whatever you see, is decoder.
Since this is the complete transformer architecture, you see both the pieces, encoder and decoder here. Now, if you notice something from encoder, we are doing an input inside decoder. If you only talk about decoder, this right piece is the only thing that we have to look for.
And GPT implements only this right piece, which is decoder.
If you look at all of the pieces on the right side, you would now be able to understand what all of them does.
Now, here in the bottom, it says output embedding. It is because whatever is the output from the decoder, we again put this as input, right? So, this is why it says whatever was the output embedding, that would again go as input. We add something to the input embedding, which is positional encoding. Once that is done, we do a masked multi-head attention. Now, if you remember, for decoder, we mask everything on the right hand side to make sure that it only looks on the left hand side. So, this is why this is masked multi-head attention.
We do addition of the values to the actual embedding. We do normalization.
Now, you again see a piece of multi-head attention and normalization, because this is of the input here from the encoder. Here is the feed forward layer, where all of the embedding does a private reflection.
And then again, addition and normalization. And once everything is done, we convert them into logits, and then we do a softmax in order to get the output probability. Whatever is the output probability, based on that, we select our next token.
Now, if you remember temperature, top P from our previous discussion, those are applied at the output probabilities to change or to add randomness into the next token prediction.
So, this is all about transformers. I tried to make it as simple as possible.
I would request you to go through this video once more. I'm going to add some of the references in the description.
You can go ahead and look at the references to understand Transformers more clearly.
I hope now you are not scared of this diagram and you would be able to understand how Transformers work and how LLMs understand whatever we input as text and generate the next token and repeat this cycle again and again to generate a complete statement.
This was all for today and now we have completed the complete understanding of how LLMs work. In our next video, we are going to talk about prompt engineering.
Till then, keep learning, keep growing and keep sharing.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
Generative AI vs AI Agents vs Agentic AI | Features And Differences Explained |
SimplilearnOfficial
4K views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
Steve Jobs responds to question about artificial intelligence, IDCA, 1983
stevepmp
599 views•2026-06-16
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Trending
Nobel Scientist Creates Device to Harvest Water From Desert Air
DrBenMiles
2200K views•2026-06-16
He’s the RICHEST MAN in AFRICA
Schoolofhardknocksshortz
1032K views•2026-06-19
The First Photos On Venus’ Surface
CleoAbram
5145K views•2026-06-18
Didn't Think It Could Get More Pathetic
penguinz0
377K views•2026-06-21











