This video provides a lucid decomposition of the mathematical elegance behind RoPE, making a complex architectural pivot feel intuitive. It is a masterclass in technical communication for anyone serious about understanding transformer internals.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
RoPE: Understanding Rotary Positional Embeddings in transformersAdded:
Hello everybody. I'm Aritro. And today I'm going to try out a very different thing. I'm going to brain dump you with all the learnings that I have had in this one day. So, due to Gemma 4 having introduced something called pruned rope, I thought why not work on rope and go our way into what pruned rope might do. So, this video will only talk about rope and how I understand it. And probably in the next video, in the comments, if you want me to cover the next video, um something called pruned rope. So, let's begin.
The first bit is to understand why we need positional embeddings at all.
Uh the term permutation equivariance is something which we have always been hearing because attention or even MLPs are permutation equivariant. That means it's a set operation. So, here in this code, what we do is we import torch. We set the configurations like batch size is one, the sequence length is five, and the dimensions is two just for the sake of brevity.
We also introduce X, that is the input, which is just a torch random normal vector with the shape batch size, sequence length, and dimensions.
We build an attention projection. It's called attention projection for the for the sake of it because it just projects your inputs to something else to a to a latent. And that kind of is split into QKV. We'll we'll talk about it later. And MHA is just multi-headed attention with the embedding dims as same as dimensions. We keep number of heads as one.
And the batch first is true because the first channel is the batch, right? And then we go to projecting our X into QKV because the output dimensions is three times that of dims. And then we split the tensor into dims with the axis of two. That means we are splitting on 0 1 2, that is three of dims. We are splitting it into three, and we have QKV, each of shape batch size, sequence length, dimensions times three. Right? And then we submit that key and uh query and value to our multi-headed attention layer.
And the the next part is something which you might not have noticed uh is if we create a random permutation index and we randomly permute our input matrix and send it to the attention projection, and then we split as we did in in the above code lines.
We can assert the output with the permuted indices as the output permuted.
So, if you permute the if you permute the inputs, the outputs are being permuted the same way.
Now, why is this bad, Aritro? This is bad because you don't know what positions are, as in the attention mechanism does not know what position is. So, if even if you change the tokens and swap each each and every other token, and the position notion is lost, the output kind of mimics the entire permutation. It does not show that, "Hey, the positions were um you know, swapped. So, here is a different vector altogether, or here is a different representation altogether."
It does not know the positions, hence it just changes its place, and that's it, which is bad because as we know, LLMs or even vision tokens or patches or whatever be it are all representative of something because of the position they have, the context they they carry. Uh if you say dog bites a dog, both these dogs are different. But if you change their places in the input without providing any positional information, what what is going to happen is uh this dog is considered to be the same dog in the list in in the last of the sentence. So, that's bad. we know that permutation equivariance is a bad thing for the attention mechanisms to work.
We need a notion of position to be added into the inputs before they go into the attention mechanism so that we know that we are we are already covered in that base.
So, the first intuition that we have is attention mechanisms are permutation equivariant, which is bad.
We need to insert or we need to inject some notion of positions into our inputs. With that, the most intuitive thing is to do is to add the positions as is in integer format into the inputs and see where that takes us. The problem here, as you can notice, is that the norms of X's with the position blows up, and the norm blowing up means that it is a recipe for disaster. And once we train with these huge vectors, and the magnitudes of the vectors are large, it's probably going to explode our gradients, which is very bad for our model.
The next best thing is we could just represent these binaries we could just represent as integers, which are in decimal format, into binaries. Now, the norms the problems with norms have been discarded. But the problem now is that the changes of the of the binaries are very jumpy. They are discrete in nature, and they go from one place to other very quickly.
And notice how I made a mistake here.
Um I I I just realized while I was recording right now that I've made a mistake. The both both of these values should be one. Um having said that, what I really want to take away from this slide is that the least significant bit or the LSB frequently turns on and turns off, and the most significant bit does not turn on and off that frequently. So, remember this. Keep in Keep it in mind because this will come in handy in the later slides. Due to the fact that binary positional embeddings are discrete in nature, and the transitions are jumpy, what we need is something very continuous. Uh continuous things are easy to model, and uh neural networks really like continuous things.
So, the jump from binary positional embeddings into something more continuous is done by sinusoidals. And with sinusoidals, we kind of come to 2017, where these things were introduced in the millennial paper called uh attention is all you need.
And what we do and what why why is there so many things all at once in this slide is that I ideally wanted to not only talk about sinusoidal positional embeddings, but also compare with what we had in the previous slide called the binary positional embeddings. You notice this chart which talks about the which plots the uh least significant bit and the most significant bit and and all the also the bits.
Here, it's very similar. Do you notice that the least significant bit, so so to call, is uh continuous, but it also is oscillating very quickly, very frequently, while the most significant bit does not oscillate at all in this in this plot, but later down the line it will.
So, this is interesting. We went from we went from integers uh where the norms were exploding. We went to binaries, where things were discrete, and we wanted something continuous. And we went to sinusoidals, which are continuous, which we really like. But that also comes with a little bit of problems. The problem is that when you when you add something to our um to our inputs, the semantic information of our inputs get hampered because we're adding something on top of it. So, the magnitude of the vector gets hampered.
The solution is to look at something which is multiplicative in nature.
And with that, the the idea is let me move myself here. Yes, the idea here is that the the lookup word or query in our dictionary and the keys nearby should not have should not should have more influence than distant ones. So, if you are close closer by, that should influence more.
If you are far off, that should influence less. And if you if you look at the dot product equation, you see that the magnitudes of A and B are decoupled from cos theta, which is the angle at which the vectors are the the angles between the two vectors.
And that kind of translates into what we do with QK trans QK transpose.
And the research team gathered all their focus and said, "Hey, QK transpose is dot product, and we don't want to inject anything into our Q and K, but we want to kind of understand or feed or inject some positional information into this. What can we do?"
What we can do is we can now look at cos theta or or the angle between the two vectors and rotate the vectors. Do you see where I'm going to? So, we first injected as in added. Now, we look at something which is more multiplicative in nature, and hence the angles or rotation angles. But before that, let's talk about tradition a bit.
So, here I have plotted a vector, which is 1 0, that is X1 and Y0. So, this is the vector.
And I want to rotate it 45ยฐ, and the rotation angle degrees is 45. I did a degree to radians because most of the rotation matrix formulas that we will work on will work on radians and not degrees, but later down the slides what I also do is I convert it back to degrees because we as humans can comprehend degrees better.
So, 45ยฐ good. We change the degree to radian with with this and then the rotation matrix is just cos of alpha or the degrees minus sine of alpha sine of alpha and and and cos of alpha that is the rotation matrix and we just do a vector multiplication of rotation matrix with the vector. And as you can see that the first vector is red which is 1 0 and then we rotate it counterclockwise for an angle of 45ยฐ and this is the green vector. So, in the in in the entirety what we do is we have this vector 1 0 and we rotate it with the help of a rotation matrix. Keep this in mind, we'll use it later.
Finally dive deep into the rotary positional embeddings.
Let's make sense of this entire slide.
First starting with the diagram.
The diagram talks about positions in the in this axis 0 1 2 3. These are the positions. So, we can we can say that we have four tokens. Token number 0, token number 1, 2 and 3 and these are lined up like so.
And each of these tokens have eight dimensions. So, we can say we we can see that there are four distinct orange boxes inside of which we have distinct blue boxes. So, 4 into 2 that is eight dimensions which is also signified by D is equal to eight here.
Now, what we do with rotary positional embeddings and and the previous slide might make sense is that we we have a vector with two values 1 and 0 that is the X and Y and then we rotate it with a with a rotation matrix.
What we do here is we do divide this eight dimensions up into a set of two that is two here, two here, two here and two here. And then we decide on what angle the alpha would be for each rotation.
So, the first part is how rotation works.
Now that we know how rotation works with a vector and the rotation matrix, what we do here is we divvy up these entire embedding matrices into into two into a form of two so that this can be rotated, this can be rotated and so on. So, every pair can be rotated independently.
Let's talk about this formula now. WK WK or omega K which which is the radians which calculates the radians of rotation. So, if we if we if we insert K and D inside this formula and theta which is 10 which is 10,000 which is very similar to what has been used inside of attention is all you need that is the positional sinusoidal embeddings, we see that the radians are these and once we convert radians into degrees, we see that the least significant bit or or to say the the most the rightmost part rotates very frequently while the leftmost part rotates very less.
Now does the point which which I made earlier in the slides connect to this?
So, think about integers, think about binaries, more the least significant bit, the most significant bit, then sinusoids and now rotation. Everything is kind of connected with rotations, with frequencies, with how fast the rotations happen, how less frequent the rotations happen and so on. So, just to tie the entire thing up, we have embeddings. We divide each embedding into two because two is what we need to multiply with our rotation matrix to rotate. And how much do we rotate is decided by this formula and the formula gives us radians. We can also see it in degrees, but radians is what we we like, we feed into the rotation matrix.
Now, with that it it comes it it it is time that we have the radians, we have to insert these radians into the into the formula cos the the rotation matrix formula and we have a 4 comma 4 comma 2 comma 2 which is overwhelming, but stay with me. Four is the number of positions, so we have four tokens, so that's pause.
Four again is the dimension by two which is eight dimensions divided up by into two separate pairs, so that is 8 by 2 is 4 and each of this is rotation matrix of cos minus sine sine cos, so 2 by 2.
Now, let's look at how it it works in code. I did not want to put up code here because it was a little overwhelming for me to explain, but what I did instead is jot out the shape of all the tensors and see what we do instead in the code. So, the rope the shape of the rope tensor looks like this batch heads and batch is the is the batch channel. I I I'm pretty sure you know this. Head is the number of heads we want for our multi-head attention.
A good thing to notice is the head is always one because rope does not depend on which head the rope matrix goes in.
It's just broadcasted, so we can just we can just swap it out with one.
The sequence length is the number of tokens that we have in the sequence.
Dimension by two two cross two is something which we already know.
Now, the Q and K have batch head sequence length dim, the same thing, but but we want to use this Q and K later down the line with rope with the rope or the rotation matrix and rotate the embeddings. So, we need to have a way to figure out the dimensions and have to line them up with rope. So, what we do is we just divide dims with two and then we add or expand this dimension and the the second half is kind of given here so that it it aligns with the rope tensor.
And at the end of the day what we do is we basically do this.
That is we just multiply our order matrix multiply this rotation matrix with our vectors with our two vectors A and B.
And easier way to do this is just by doing a multiplication and add.
I'll leave this slide for some for some time to make sure that make sure that you understand it and also that this is absolutely the same as this.
And with that I end this, but also if you found yourself from zero to maybe five in terms of rope and you understand a little bit, but you need more in-depth knowledge, please check out the blog post from Christopher Fleetwood.
He's he's a really good writer and this blog post was my entry into rope and also the outlier. I I recommend his channel and I could not recommend him him more to everybody out there. He is great. His blog his video on rope really made sense to me and also the animations are really good.
So, a lot of things have that I've put in slides are just brain dumps of the blogs and the videos. So, I'm citing everybody and please make sure to leave a comment in this video if you don't understand something and I'll be I'll be looking forward to you know interacting with you inside of the comment section or if this goes out in Twitter, LinkedIn or whatever, just engage with it so that I know where I need to work more and if you want these deep dive topics, please let me know.
Bye-bye.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsโข2026-05-29
Long-Running Agents โ Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsโข2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsโข2026-05-28
BREAKING: Microsoftโs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsโข2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsโข2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsโข2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsโข2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsโข2026-05-30











