Masked self-attention is a mechanism that enables decoder-only transformer models like GPT to train in parallel while preventing the model from 'cheating' by looking at future tokens during training. During inference, models generate text one token at a time autoregressively, but during training, processing the entire sequence simultaneously would allow tokens to see future words, destroying the model's ability to learn prediction. The solution uses a causal maskβa lower triangular matrix with zeros along the diagonal and below, and negative infinity in the upper triangular region. This mask is applied to attention scores before softmax, ensuring that each token can only attend to itself and previous tokens, while future tokens receive zero attention weight. This mathematical constraint allows parallel training speed while strictly enforcing causality, making it the fundamental mechanism behind all decoder-only LLMs.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Masked Self-Attention Explained: The Causal Trick Behind Every GPT ModelAdded:
Here's one of the biggest tricks behind GPT.
During inference, the model generates text one word at a time, but during training, it processes the entire sentence simultaneously.
Now, think about that for a second. If the whole sentence is visible, what stops the model from just looking ahead?
Nothing. And that's a problem.
It's like giving a student the answer sheet during an exam. Sure, they'll pass, but they learned absolutely nothing. The model would never actually learn to predict. It would just cheat.
So, how do we keep the speed of parallel training without letting the model peek at the future?
That single question leads to one of the most important mechanisms in modern AI, masked self-attention. Let's break it down.
To understand why it generates one token at a time during inference, let's take a look at this Transformer architecture.
It has two parts, the encoder on the left and the decoder on the right. But generative models like ChatGPT and Gemini only use the decoder part of the Transformer. They completely remove the encoder and its cross-attention connections, leaving just the decoder.
This is the key insight. GPT models are decoder-only Transformers. No encoder, no cross-attention, just a decoder stack.
Now, to understand why these models generate one token at a time, let's see how the decoder generates tokens during inference one token at a time.
The decoder kicks off with a special token, BOS, beginning of sequence.
Basically, a signal that says, "Start generating now." This BOS token passes through the decoder block. When it passes out of the decoder block, it gives us a probability distribution over the entire vocabulary, and we simply pick the word with the highest probability.
Now, in the next time step, we feed both the BOS token and the word I. The decoder processes them and generates love.
Next, we add love to the input list. The decoder processes all three tokens together and generates two.
Now, we add two to our sequence, again through the decoder, and we get eat.
Next, we feed the token eat into the input list. The decoder processes them and generates pizza. Finally, we give it the full sequence so far, and this time the decoder generates the EOS token, end of sequence. That means generation is complete.
And from this, we can clearly see the decoder's generating one token at a time during inference. Each step, it takes everything generated so far as input and produces just the next single token. And this is exactly how models like ChatGPT work. And these types of models are known as auto-regressive models, which generate the next token based on all previous tokens.
Don't worry about fully understanding the decoder architecture right now. I just wanted to show you how a decoder-only model generates text one token at a time. We will learn how a decoder-only model actually works in our next video.
So, during inference, which we just saw, the model is forced to generate text sequentially, one token at a time. Think about it. Even you and I speak one word at a time. You don't blurt out the entire sentence instantly. Each word comes from what you just said. LLMs work the exact same way when generating text.
But, think about it. If we use the same sequential approach during training, it would be just as slow. We'd have to process each token one by one for millions of training examples. That is simply not viable.
So, instead, during training, we feed the entire sequence all at once and process all tokens in parallel. This is the key to making training fast. In this attention matrix, self-attention computes a score between every single pair of tokens simultaneously. The diagonal represents tokens paying attention to themselves, and the lower triangle represents the model paying attention to valid past context. But, here's the massive problem. Look at this upper triangle of the attention matrix.
When we process a sequence in parallel, the token I can already see love to eat and pizza, words that don't exist yet during generation. The model is straight-up cheating by looking at future tokens.
This is pure data leakage, and it completely destroys the model's ability to learn how to predict the next word.
So, we're stuck. During inference, generation is inherently sequential because each new token depends on the previous ones. But, during training, we want parallel computation for speed. The problem is parallel training leaks future tokens. So, how do we keep the speed of parallel training without letting the model peek at the future?
That is exactly what masked self-attention solves. So, our goal is simple.
When the model is looking at a word like eat, we want to make sure it can only see the words that came before it, not anything that comes after it. And to enforce that, during parallel training, we place a mask, a kind of block, over all the future words so they become completely invisible. Now, once that mask is in place, if we focus on the word eat, it can still look at I, love, and to to understand context, but pizza is gone, hidden behind the mask. It simply does not exist for this step.
This is how we keep the speed of parallel training without allowing any cheating from the future.
Now, let's dive into the mathematics of how this is done.
For this, we will use our same sentence, I love to eat pizza.
Before we can do mass detention, we first need to turn our words into numbers. Each word gets its own vector, a list of numbers that captures what the word roughly means.
These are our token embeddings. Each word now has its own unique set of numbers. We repeat this for every single word in our prompt, from X1 up to X5.
Stacking these vectors row by row produces our input matrix, known as X.
This matrix X is of shape 5 by 4. Five tokens, with each token mapped to a four-dimensional vector.
Now, we take this input matrix X and project it into three distinct spaces.
These are queries, keys, and values.
Each matrix captures a different role that the tokens will play during attention. We use three separate weight matrices, WQ, WK, and WV. These are learned during training, and each one transforms X in a different way.
Let's see how this works.
To build Q, we multiply X by the weight matrix WQ. We multiply each row of X with columns of WQ to calculate the elements of the query matrix.
We repeat the same process for the rest of our sentence to build the complete query matrix.
Same process for keys and values.
Multiply X by WK and WV. Now, we have all three.
The query matrix asks, "What context am I looking for?" The key matrix says, "Here's the context I offer." And the value matrix holds the actual information that we need to move forward.
With Q, K, and V ready, the next step is computing how much each token should attend to every other token.
Time to compute attention scores. We multiply Q by the transpose of K. We transpose K so the dimensions line up for matrix multiplication. This gives us a grid of raw scores, one number for every pair of tokens. Let's compute the similarity score for the row representing eat.
By running the same dot product for every other pair of words, we quickly fill up the rest of the score grid.
Here's our complete raw attention map.
Next, we scale these dot products. We divide the attention scores by the square root of the key dimension, DK, which is four in our simplified model, meaning we divide every score by two.
Without scaling, large scores would push soft max into regions where gradients basically vanish. But wait, if you look closely at these scores, you'll notice a critical issue. Our tokens can see into the future. In auto-regressive generation, a token should only be able to see itself and the words that came before it. It cannot peek at words that haven't been generated yet. For example, look at the row for eat. It has calculated a positive similarity score with the word pizza. This is look-ahead cheating. To stop this, we apply a causal mask. Let's fix this by applying the mask.
Here's the causal mask. It tells us which tokens are allowed to look at which ones. The causal mask contains zeros along the diagonal and below, and negative infinity in the upper triangular region. Zero means allowed.
Negative infinity means blocked, it's in the future. Applying the mask adds negative infinity to these blocked positions, completely replacing future scores. For row eat, the score of 0.81 with pizza is substituted with negative infinity. Every cell in the upper triangle is now negative infinity. The future is completely shut off.
Now we pass these masked scores through soft max to turn them into probabilities. Softmax exponentiates the values and normalizes them so they sum to one. But, what happens to negative infinity? E to the power of negative infinity is just zero. So, the attention on any future token drops to exactly zero. Look at pizza. It's weight is exactly zero. Eat can't see it at all.
The remaining weights all add up to one.
Eat gives 0.29 to itself, 0.21 to two, and so on.
Now, we apply softmax row by row to get the full attention weights matrix. All future positions are zeroed out. The mask did its job. Now, here's one interesting thing to note. Even after the masking, even though we set half the matrix to negative infinity, the normalizing still gives one. Every single row in this attention weights matrix still sums to exactly one.
Token I only sees itself. It's entire row collapses to 1.00.
Love splits between I and itself. And pizza, the last token, gets to attend to everything before it.
Finally, we multiply the attention weights by the value matrix to get our output Z.
We do this for the remaining words, giving us our fully contextualized output matrix Z.
This output matrix Z represents our context-aware embeddings. Each row is a representation of a token, now beautifully enriched with information from the words that came before it.
Let's look at the calculation for row eat in detail. We use these weights to blend the value vectors together. Each value vector of our past words gets scaled by its corresponding attention weight. But, future tokens, like pizza, are multiplied by exactly zero. Finally, we add these scaled vectors together to produce the final output vector for eat.
Since future tokens had zero weight, nothing from the future leaked into this vector. And that's how masked attention works. It's what lets the model generate one token at a time without ever peeking ahead.
The result of this entire process is a brand new output vector for each token.
But let's look at what these vectors conceptually represent. We can see that the token I can only see itself having a weight of 1.00.
The word love can look at I and love only, perfectly ignoring all future tokens. Same thing for two, eat, and pizza. Each one can only mix with what came before it. So, by using causal masking, we've perfectly solved the cheating issue. We now have the best of both worlds, high-speed parallel training while strictly ensuring that no token can ever see future tokens.
Take that attention weights matrix and turn it into a heat map. Bright means the token is paying attention, black means zero. Look at the shape you get.
This staircase pattern? That's the fingerprint of every generative model ever built. It doesn't matter if the sequence is five tokens or 5,000. The past is always lit up and the future is always pitch black.
But think about this. If we set the future attention weights to exactly zero, why do we even bother calculating them? Why not just calculate smaller sliced matrices to save compute? It comes down to the GPU. GPUs are great at doing one big uniform operation on a ton of data at once. But ask them to handle variable, jagged, fragmented steps, they slow to a crawl. Keep the matrix uniform, slap a mask on it, and now you're computing attention for all tokens in one single shot. That's the trick that lets us train models with billions of parameters.
This causal constraint is what separates the decoder from the encoder. In an encoder, every token can see every other token, past and future. No restrictions.
Full context in both directions. In the decoder, that's completely locked down.
Each token can only look backwards. It cannot peek at words it hasn't generated yet.
Now, let's go through everything from top to bottom in one clean pass without skipping anything.
We start with our same sentence. I love to eat pizza.
Each word is mapped to a vector, which we stack to form our input matrix, X.
We run X through three separate linear projections to get Q, K, and V.
Query asks, "What am I looking for?" Key says, "What context do I offer?" And value holds the actual information. Same weights, same math, no difference here.
To get our scores, we multiply Q by K transpose and divide by root DK, same formula as before.
The result is a 5 by 5 matrix, where cell IJ tells you how much token I wants to attend to token J. And here's the problem. Look at that upper triangle.
Every value above the diagonal is a token attending to the future. Look, eat scores 0.81 against pizza. That word hasn't been generated yet. Pure look-ahead cheating. We have to kill that entire upper triangle.
Here's our weapon, the lower triangular mask. Zeros along the diagonal and below, that means past and present tokens, totally fine.
But above the diagonal, negative infinity. The future is completely shut out. One matrix, dead simple, and it works perfectly.
Now we apply the mask. We don't just zero out future scores, we slam them with negative infinity.
Why negative infinity and not just zero?
Because zero still survives softmax.
Negative infinity doesn't. Softmax, it exponentiates every score and then normalizes across the row. Each row sums to one. E to the negative infinity is exactly zero, not just approximately zero. So, every masked position gets wiped out completely. No cheating, no gradients, no information from the future.
Look at this probability matrix. Upper triangle, all zeros. The diagonal and below, live attention weights summing to one per row.
Token I attends only to itself. Pizza attends to everything before it.
Causality enforced mathematically.
Last step, we multiply the probability matrix by V. These weights decide how much of each value vector flows into the output.
Since everything in the future is zeroed out, only past and present tokens actually have a say in the output. What comes out? Each token now knows about everything before it, but nothing after it. That's the whole point. No peeking ahead, no cheating. Every token only knows what it should know.
Full pipeline in one shot. X in, Q, K, and V out, scores scaled by root DK, mask applied, softmax kills the future, weighted sum over V, and output out. We train in parallel. We respect causality.
That's the deal.
That is masked self-attention.
And that's it. One mask, just a triangle of negative infinities, and we've solved the entire problem. Think about what we just achieved. We get the best of both worlds. Because we process the whole sequence at once, we can train our model massively in parallel, making it incredibly fast. But, because of the mask, we mathematically prevent the model from cheating and looking into the future, stopping data leakage in its tracks.
Next video, we're tackling the one question this series has been building up to.
How does a decoder actually go from a prompt to a full sentence word by word?
Next video, decoder architecture start to finish.
And see how masked attention fits into the full text generation pipeline.
Next up, we scale it all to a complete GPT pipeline.
If this clicked for you, hit like. It genuinely helps. Subscribe so you don't miss the next one. See you there.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsβ’2026-05-28
How agent o11y differs from traditional o11y β Phil Hetzel, Braintrust
aiDotEngineer
450 viewsβ’2026-05-28
Re: π£οΈπthepropheduπ2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 viewsβ’2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanationπ―β
LearnwithSahera
1K viewsβ’2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsβ’2026-05-29
Search Algorithms Explained in 60 Seconds! π€π¨
samarthtuliofficial
218 viewsβ’2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsβ’2026-05-30
Instagram accounts got PWNed
EricParker
13K viewsβ’2026-06-03











