Large Language Models (LLMs) are trained using next-token prediction, where the model learns to predict the next word in a sequence given previous tokens. The transformer architecture processes tokens through multiple layers containing attention mechanisms (which allow tokens to attend to previous tokens) and MLP layers (which process and transform information). Key components include token embeddings (vectors representing each token), RMSNorm for numerical stability, RoPE for positional encoding, multi-head attention for parallel processing, and causal masking to prevent future token leakage. The model generates logits for all possible tokens, which are converted to probabilities via softmax, and trained using cross-entropy loss to minimize prediction error.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
LLM Transformer Explained From Scratch - Beginner CourseAdded:
Let's learn transformer theory from scratch. So we're going to focus on modern decoder only uh GPT modern LLM.
So we have some input the cat set on the and our transformer needs to predict Matt the next word here. So during training the model learns at every position. So given the it needs to predict cat given the cat it needs to predict set. So during the training, it's going to get this sentence, but we're going to actually utilize all of these multiple times. So just given this, it needs to predict this. Just given this, it needs to predict the next one. This is how we can use the data we have a lot better.
And by trying to predict the next word during the training, the LLM will learn some general patterns about language.
This is how it's able to uh learn all of this like the CH GBT does and writes. So usually X is going to be the current tokens that we have. This is the split of the data that we have and Y is the next token that needs to be predicted.
So if we are splitting this we have the cat and try to predict set. So this is X this is Y from X predict Y. So usually we will have uh multiple conversations multiple independent conversations or texts it's called batch batch of texts and each will have some number of tokens. In this case let's consider tokens to be just words.
Um so multiple conversations each having some number of tokens. So batch and token dimensions and then each of the tokens will be represented each of the words will be represented with some number uh some array of numbers which is just an array of numbers that represents this uh each word each possible token there is maybe 300,000 or 200,000 possible tokens possible words uh depending on the model itself some like chat GPT or claude they might have different number of tokens And each token is represented by array of numbers. And in this array of numbers, each number means something.
It can be like how green something is, how alive something is, how heavy something is. But these numbers are learned by the neural network. It's not determined by humans. So imagine that a word cat has some array of numbers, maybe 7,000 numbers. Each of those numbers is in a dimension that that dimension represents something for the whole for every possible token. That dimension represents the same thing. So first number is going to represent the same thing for every single token.
It can be like how much fur something has or how green something is.
So then we will have some number of independent conversations. Each conversation will have some number of tokens or words and each token will be represented with some uh vector and so now there is this so when the LLM is generating tokens it has these like blocks or layers. So this these tokens will first go through first block then they will look loop back to second third end times. Uh blocks may be same or may not be same.
For example first blocks may use some dense attention heavy computation while later blocks may use lighter easier faster computation. Maybe beginning blocks are going to understand the semantics, the semantic meaning of the words and later blocks are going to uh try to understand what comes next for example. But this is all learned by neural networks.
So uh humans we can just try to understand what neural networks learn but we don't we don't determine how these blocks behave or we can determine to some extent we can determine their architecture their networks but we don't determine what features they learn and uh what we don't know really what they learn we just determine their architecture how they work.
So uh after all of these blocks the shape is going to be same. So uh batch token D model is going to stay batch token D model and uh at the end after all of these blocks.
So this is the whole workflow by the I didn't say that. So in the beginning we have input token some text then it gets converted into like all of the text gets converted into the uh token embeddings the vectors replace. So text words are replaced by their corresponding embedding. Each one has embedding then it's passed through entire transformer block multiple times like 20 30 maybe in GPT and latest models maybe 50 to 100 times even more and at the end um after we process all of this so uh there is this normalization this is just to make things more stable like uh the numbers more stable the vector embeddings more stable because if we have some array that has numbers 1 0.5 0.3 and then 10,000 and then 1 2. So this 10,000 will cause some issues later especially if in the soft max. So we want to clamp down and restrict and maybe make it more uniform or even more spiky because this RMS norm also has some learnable parameters that can shift the the all of these token vectors.
So it just to make numbers stable and so all of this is to process information the pro process context maybe mix up context understand what's going on um etc and then lm head is going to generate the next token so LM head is going to convert from our text so our text into into probability distribution for every possible token it's going to assign some probability to every possible token. So from the previous text, it's going to generate a probability distribution for every possible token in the vocabulary, maybe 300,000.
And um it will then we can pick from this probability distribution in any way. Actually, we learn all of this in even more detail in school communities.
So uh everything from neural networks, math, pytor, transformers, automated AI research. So you can join school community below the video and we have daily posts like five research ideas etc. So you can join the community with others where everybody learns to become AI researcher together. So B is going to be batch size. T is context length or number of tokens in the current conversation.
Uh D model is the hidden size. So this is the size of the vector that represents a single token. And this is also this the width of the transformer because this is the main thing that gets processed this vector. So this is the hidden size or model size or vector embedding size because those are the same same like equal equally sized and vocab size is a number of possible token ID. So this is entire vocabulary of all possible words all possible tokens.
So let's understand the residual stream.
So this is just the stream. This is just you know when token embeddings start here in the first block they get processed go through all of the blocks.
So what's going on here is u that token is just the stream that's going on through these blocks. So this is just the stream. Now this token might actually be taken away, split and processed a bit in attention mechanism or feed forward but there is always unprocessed tokens that are going on and then later this process might be added to the unprocessed and so at the end we have a bit process.
So so you have a stream it may split and get processed and then added back to the stream.
So that's why it this stream is getting modified slowly.
So, so in the beginning X is just conversations and each conversation has tokens and each token is represented with a vector. So, it's just token embeddings in the beginning but after every block it becomes more contextual.
It takes more context from things around it.
So, let's take a look at the transformer block. This is one of those blocks. So it has attention and MLP which is usually swiggloo MLP. So this is our input that I was mentioning. So this X it's going to get processed through attention but it's also not going to get processed. So this is this X is not getting processed and this attention is splitting off and processing and so they are added here like this.
And then next it's going to also split and get processing with MLP and then get added to nonprocessed version.
So but when it gets added then it's going to uh receive some of the processing of this attention or swigloo MLP. So that's how it's flowing. So nonprocess gets added to processed as and here after that as well.
So nonprocessed with attention and then this X is going to be this X and it's going to be this X. So this X is is same as this X which is same as this X.
So RMS norm is just stabilizing numbers.
As I said if you have some big outliers, big numbers or bad distribution and attention is going to distribute. So attention is going to make every token uh the cat. So every token will get information from previous other tokens.
For example, set will get information that the cat was sitting. So set refers to cat.
So it doesn't refer to dog. So attention will just mix up all of the information of the previous tokens.
But set will not look at the next token.
It will just look at the previous tokens. This as well.
And after it's processed by attention mechanism then MLP will because attention will just add context. It will not process it in any other way just add context and then MLP will process it maybe change it a bit add some more information some other facts or knowledge about the world. So imagine that attention is mixing information, mixing, adding information, just mixing and MLP is processing that mix and adding more facts and mix and attention and MLP are happening in every block. So this has next has it next has it. So each block has one attention and one MLP blocks. So this is this should be called maybe layer transformer layer and attention and MLP are blocks but it's maybe confusing so I don't want to confuse you. So let's say one transformer block or one transformer layer has both attention and MLP one of each and then that process information goes back goes back around into attention again.
So after each block, all of the tokens will have more context from previous tokens and they will also have more processed information by MLP. And so imagine when you have X plus something new, it's going to add something new to X. This type of attention where tokens can only look at the previous words is called causal attention. They can only look so they can look left or before them. They can also look at itself but they cannot look at the future tokens after. So if we have the cat set, the can see the cat can see the cat and set can see all three. The cat set. Okay.
Now if we just put this the cat set into transformer, it does there is no way in attention. Attention doesn't know the order of words. Attention will process this but doesn't know order of words. So that's why we use rope. uh this will we will change these vectors a little bit.
These are all vectors. We will change them a little bit to indicate their position. So this is the first position, second, third. So originally to indicate positions for each token embedding we would add some position embedding.
But uh it's better to use rope. This is the new method that's better. It's about rotating the vectors. And so for each vector, we're going to split into two dimensions. Two dimensions, two two numbers, two numbers, two numbers, two numbers in the whole vector and rotate each pair a little bit. So the idea is that uh first of all later pairs are going to rotate um more or less to indicate that they are later and later words later tokens will also rotate generally more or less depending on the design to so this rotation will transformer will later understand based on how much they are rotated uh their position. So token embedding describes the content of the token and ropes rope helps attention mechanism knows where the tokens are relative to each other.
So it's relative to each other. So uh this guy is two distance away from this guy. This guy is two distance away from this guy. And then RMS norm is a highly performant lightweight normalization layer.
Actually I see that newer models and also in my experimentation uh applying RMS norm before and applying RMS norm to all of these after. So adding one RMS norm around all of this performs better especially I feel like in small smaller large language models so it it's going to perform better in my experience and I see that Gemma 4 did it as well. Google did it for Gemma 4. So so there this is worth investigating.
So RMS norm makes the vector scale easier for the next layer to work with.
It's going to eliminate some huge numbers, some very tiny numbers that can cause issues because they may not be able to be represented in the computer's floating point 16 or FP8 or some other issues with like multiplying large numbers is going to give a lot larger number etc. Okay. And then MLP. So after attention mechanism we have multi-layer perceptual which is the feed forward network that's going to process and add some more information and u usually it's not just simple uh relu MLP usually it's more complex swigloo so uh this has a gate as well as the MLP and activation so some people have experimented with very simple like relu or even relu squared relu squared is very good for um for small neural networks but usually for big I've seen people using swigloo I mean small and big LLMs so swigloo we're going to project our input two times one creates the value one creates the gate that's going to decide how much of that value is passed or applied or multiplied later or added later so gate is like um it's like some number between zero and one that's going diminish this value that this is producing. So and gate controls which values pass through. So different dimensions of the vector different numbers may be multiply with different gates.
So gate is just so the transformer will learn not only which value it creates but how much of that it needs. So now remember attention mixes information from previous tokens and swigloo MLP will will transform process and transform this mixed information and maybe add more knowledge or process it in some way. So during the training this is how training works training versus generation. In training you're going to actually uh take the whole input set and just shift it by one. So the cat set cat set on. So the will predict cat, the cat will predict set, the cat set will predict on. You see how it just shifted by one. And that's how you can easily uh apply these predictions and rules.
But in generation, it will just generate the next word. So generation doesn't need to predict every single one.
Generation just takes all of this and gets the next one. So this is after the model is trained when we are inferencing generating.
So uh that's going to be the checklist for this first lesson. If you are in school, you may post what you did here.
Make a post and report your progress.
Let's go to the next lesson. So the first thing in lesson two is to understand that there is this path token that is at the beginning of the sequence.
And so imagine that every token has its index. So we will actually map tokens to indices. So the cat set on mat is going to be okay. So the is number one. Okay.
So we're going to replace these tokens with their actual indices.
So in this case it's 1 2 3 4 1 5.
The is one. So this is what's going to happen as the first step. And then the uh when we have x inputs and y predictions, we're going to shift by one. So imagine we have these tokens, we're actually going to set x to be uh this part and y to be this part. So if we have this, we predict this. If we have this, we predict this. If we have this, we predict this. So you see, you have you have this, you predict the next token.
So that's the whole training idea.
That's what you optimize for. That's what you train for.
Context length is how many tokens the AI the LLM sees at once.
So if you have context length four, that means it's going to see four tokens.
So if you want to predict the next token, then you're going to have five context length. So to train to train you need one more token at the end.
LLMs have maximum context length which is maybe 200,000 or 1 million. That's the maximum amount of tokens they can fit and they cannot process any other.
There can be less than that but not more than that. Then also we will not just train one, we will train multiple at once. So you see you have a batch of uh two conversations. The first conversation predicting these outputs and second conversation predicting these outputs.
So this is a batch of two independent conversations.
Here X shape is 24. So we have two conversations each having four numbers and Y is also two four. So each conversation having four tokens or four four index indices.
This is example code of how you can implement this. So what this code is going to do is it's going to pick starting from some index uh the entire context length of tokens. So block of tokens and for the so this is x and for y it's going to just shift by one tokens. So starting from some index + one until uh index plus context length + one. So this is app X this is Y at any arbitrary position and then in the model once we have these u conversations and each has tokens we're going to swap out each token for its embedding. So each of the numbers is actually going to be swapped for the embedding um embedding embedding vector embedding.
So your LLM will have some configuration.
This is example of the configuration. So vocab size is how many possible tokens there are. All possible words usually 300,000.
Context length how many tokens maximum.
Uh dimension of the model is dimension of the vector with which tokens are represented. So this is vector embedding.
number of layers or number of transformer blocks that we me mentioned earlier. Each attention block will be divided into heads. I did not mention this at all.
So we can split attention into heads. So each head will learn different things to pay attention to different things but I'll talk about that later. And hidden multiplier is inside of your feed forward network. when you have your MLP the feed forward networking. So you have input multiplier four times more neurons and then output going back to the to the model size. So this is where all of the facts and processing live.
Usually it's four times or 3.5 times.
It's going to be four times this model dimension or token embedding dimension.
So it's also written here as well.
So uh you have dimension of the model or token it needs to be divisible by number of heads. So each head is going to have in this case eight dimension eight dimensions eight numbers. So this is going to get split into eight heads eight parts or sorry four parts. Four parts four heads each having eight dimensions and each is going to perform independently learn different things.
There are some optimizations to mix these tokens etc. But we will talk about that in maybe different videos. It's advanced topic. So you need to assert that model dimension is divisible by number of heads. So when you're going through transformer blocks each block remember contains attention and MLP. The shape in and out of the block is always same. The shape might change for example in MLP but in and out of the block is always same even though it gets diver diverged or multiplied or increased or decreased inside of the block.
And now let's see attention mechanism.
So attention each of the tokens is going to so so you have tokens. How does token know? How does token take information from different tokens like set token set? How does it take information from token cat? So the cat set. So it's going to use uh query key and values.
So just remember that each token has query key and value and query is what information I'm looking for. Key is what information I contain the description of the information and value is the actual information that it gives. So remember description of the information and the information are different things.
Although sometimes people are experimenting with combining them into same thing. Maybe deepse is combining or something some new architectures but for now it's different and it's been different since the beginning.
So we will talk more about key query value but let's see now starting from number of conversations or batch size each having some tokens each token being represented with um model with the vector embedding. We're going to split model into heads and head dimensions. So we are splitting the vector embedding here into heads and head dimensions. And you see that we swapped we swapped number of tokens and heads because we want to make heads completely independent separate from each other.
That's why we swapped.
So in our case we have uh two conversations, four attention heads, four tokens and eight dimensions per head and then rope rotations will be applied to uh only to query and key. So we don't need to apply to value just query and key to know the position because query is searching for keys as I said swigloo is going to have four x multipliers. So model dimension is 32 ml hidden inside dimension is 128 and then it so it goes from 32 to 128 back to 32. This is the MLP layer. So this one and then the last one 4x and then back to the same and at the end of MLP it's going to get added to the residual to the same tokens.
This is how we can keep some information and also process add new information and process. Let's go to lesson five uh next token cross entropy loss. So basically we want to measure how much probability did model assign to the correct token ID. So we already know what the correct next token is and we want to measure if it assigned a lot of probability for the next correct token then we will reward it the losses loss is low loss is low if it assigned little probability to the next correct token then loss will be high for that we are using negative log likelihood formula. So they're just going to convert uh big numbers into small numbers because it's loss. So big big probability number is going to get a small number because we need small loss. If the probability is high, we need small loss small error and vice versa. Small probability number into big loss. This is example of token generation script after the model training. So it's just going to generate next tokens.
It's in the next lesson. RMS norm. This is the normalization layer that we talked about. So it's going to get applied to X before it goes through attention and before it goes through MLP. Although as I mentioned right now it's also being applied after MLP and after attention. So this so RMS is X squar uh this vector squared and then mean of the vector and then square root of it. So when we get RMS we're going to divide X by RMS and then output is going to be normalized times weight. This is learned way that can also scale uh this normalized vector as well.
So this is just providing model with many tools to stabilize numbers.
This is example implementation of RMS norm queries, keys and values. So uh each of them is going to get generated by projecting the model through a linear layer. So this linear layer will take the embedding vector and generate query this one key this one value. They are separate weights. They are separate projection weights for each of these and they're always the same for every token. So one set of weights will convert any token every token into key into query and different set for key.
So query what this position is looking for what this token is looking for key what this token contains and value is the actual information. So description of the information the information let's see rope. So imagine we have a vector. We're going to split this embedding vector into pairs of dimensions and then uh look at each of these vectors as uh each of these pairs as a vector and rotate each of them and based on their position they will get a bit increased rotation every time or decrease depending on the design. So that's how the amount of rotation will tell the position of the pair within the vector and and also based on the position in the context based on the position of this whole embedding. So this is one token based on this it will also get rotated based on this position within context. So this is the formula for rotation frequency. It's a bit complex. So this is what we apply to rotate.
This is piece of code that applies uh rope. It's a bit it's a bit interesting.
It's short. Need some time to understand. So actually all of these lessons after are just more information on the same thing. So RMS norm uh transformer block MLP attention multi head attention causal attention. So you can go ahead and check them out. So this is uh 14th lesson full tiny GPT model.
This is the uh initialization.
You see it has config which is the hyperparameters token embeddings uh layers and final RMS norm and final output head that's going to generate the probability distribution over all possible tokens.
This is example of the forward pass.
You may also u tie weights of lm head and token embedding weights. Sometimes this works better.
This is the training loop. So you just go one step and then you uh generate the you do the model pass here and then optimizer loss optimizer step and then log you also want to checkpoint your training so if something happens you can just continue you don't need to start over so KV cache so for every token every next token you would need to generate keys and values of every previous token every time but using KV cache you don't need to actually so uh KV cache is just going to store all of the previous keys and values and as you are generating next token it's going to just reuse all the previous keys and values so this is why cached output is a lot cheaper and cached input is a lot cheaper I should say in LMS like GPT and anthropic join the school community if you want to write a research paper this week so we have a research paper challenge right here so inside. We're going to help you out and there are also like community and courses that you can watch. So join if you want to become AI researcher and write your research papers. So see you in that uh school
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K viewsβ’2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29











