This lecture provides a remarkably clear and rigorous roadmap for transitioning from traditional diffusion to the more efficient paradigm of flow matching. It is an essential deep dive for those who value mathematical elegance as much as computational performance.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 3 - Flow matchingAdded:
Hello everyone and welcome to lecture 3 of CM296.
So today is a very special day because we're going to talk about flow matching which is a generation paradigm that is quite trendy these days.
So if you remember lectures one and lecture two were also about generation paradigm where we talked about diffusion with DDPM in lecture one and score matching in lecture two. So flow matching is yet another generation paradigm and we're going to see in this lecture what is the mindset behind that separate paradigm and how it relates to diffusion and score matching.
So as usual we're going to start the class with a recap of the last episodes.
So if you remember lecture one, we saw how we could generate new samples. And the way we thought about doing that was taking images from our training set which we named clean images. And what we did is that we gradually corrupted them with a process that we defined a process Q which given an image at a given time step noised it by adding some gausian noise weighted by some coefficient.
And the goal of diffusion with DDPM was to find or was to learn a reverse process which we noted P theta that remove the noise gradually. And what we saw was that in order to learn P theta, what we did is try to maximize the likelihood of our model seeing the training data. And if you remember, we did some clever math with um some estimation of the lower bound. So the elbow and we derived some terms and we ended up with a pretty good looking loss which was which was the just an L2 regression on the noise that was added up to the image at time stamp t.
And so this was lecture one. Lecture two was taking another perspective and just taking a step back thinking of images in the space of the n dimensional space of images and the fact that what we wanted to do was sample from an easy distribution and then walk our way up until the true data distribution.
So how did you do that? The whole point of the lecture was to learn what we call the score which is the gradient of the log of P.
And why did we do that? Because it acted as a sort of a compass from where we were up until regions of high density.
And if we learned such a score, what we could do was use sampling techniques such as longhan sampling which allowed us to move towards high density regions but also allow first time diversity with some noise term.
And we saw that in order for us to estimate that score, we cleverly noised the initial data distribution in order to learn the score of the noised distribution.
And then we had a progressive noise schedule that allowed us to progressively work our way up until the data distribution. So we ended up with a score function that was a function of the position in space but also the noise level.
And again after doing some clever math we also ended up with an L2 regression loss which was quite easy to compute.
But what we also saw was that what we saw in lecture two was actually quite related to what we saw in lecture one because both involved adding noise and in particular we saw that there was a relationship between the score of a noisy distribution and the noise that we added from the clean image to the noisy image.
So given that what we did was think of a unified way of looking at both these perspectives and we moved from a discrete representation to a continuous one and we obtained this SD which was of the form dx equal f of x and t dt plus gt * dw where w is our winner process which is introducing the stoasticity and in particular the first term is called the drift term and it is a term that quantifies the deterministic movement in space across time and the second term is called the diffusion term which quantifies the amount of stoasticity that you have in in your process.
So this expression is the expression of the forward pass meaning you're actually noising your images and we saw that in space if we were to represent that it meant again having a drift component and a diffusion component.
So now if we wanted to den noiseise our image, we were actually interested in the reverse process which we saw thanks to uh SD result from the 80s involved knowing the score. So this expression is the reverse process that allows us to den noiseise images which is why we need to learn the score and this is where the score comes in in the continuous view.
So this was two weeks ago and a week ago. But today we're going to see this third generation paradigm which is called flow matching.
And similar to the past lectures, what we're going to try to do is to have a nice balance between intuition and math because you will see that these papers are full of math, full of equations. And we will try to approach it from an intuitive perspective, but not shy away from the math either.
And of course, please interrupt me at any point if you if you feel there's something that's uh not clear.
Cool.
Okay, let's start. So, the whole point of this first three lectures is to find a way to go from somewhere that is easy to sample from up until our data distribution of interest.
So the way that flow matching works is it has two distributions. The first one is called the initial distribution which is the distribution that you're you're starting from and then the second distribution is called the target distribution and it is the distribution that you want to end up with.
And the whole point of flow matching is finding a way to transport our initial data distribution up until the target distribution.
And just note the terminology I'm using transport. So what we want is to transport the whole density that we have and remember that probability densities have this very nice property of summing to one. So we want to transport the whole density distribution from one place to another.
So this is the idea behind flow matching and we're going to see how we we can do that.
So I'm going to just give you a warning in terms of conventions.
In the past we've seen that zero was clean data and then capital T was the noise data. But here it's going to be different.
So we will be noting P 0 our initial data distribution which is typically something that is easy to sample from such as Gausian noise and the target data distribution is going to be noted P1 T equ= 1 and this is where we want to end up with and then again so this slide is just to say that lectures one and two we said zero was in capital T was noise but here zero will be noise and one will be what we want which is the clean images which is the data distribution.
So the reason why we chose to proceed that way is just that the field chose these conventions. And if we were to keep the conventions of the diffusion and score matching fields, it would just completely confuse you if you were to read papers from the flow matching field, which is the reason why we're keeping things consistent with the paper.
But just keep those things in mind because sometimes these conventions can also be reversed depending on which paper you're reading. But this one is by far the most common.
Cool. So with that warning aside, what we want to do is to sample an observation from our initial data distribution and then map it to the target data distribution.
So in order to do that we're now going to go through some common notations and the first one is called the trajectory.
So the trajectory is noted XT and it is the path that is taken by an observation between times 0 and one. So by the way the time zero and one is arbitrary. It's just a way for us to say, okay, when are we starting and when are we ending?
And so here at time t equals 0, we have x0. And what we want to do is for our trajectory to map a point from p 0 up until p1.
So that's our goal.
So now let's see another notation. So this one is called the flow.
The flow is noted s of t of x0.
So I'm just going to write that down here. So it is a function that takes time and an initial condition and it gives you where that point is at time t.
So this is the time so belongs to 01.
This one is to your space which is maybe um I don't know dimensional and this one is also in the dimensional space and you have s of t of x0 which is equal to your point at time t which is x of t.
So the flow you can interpret it in different ways. So one way you can interpret it is it's a collection of trajectories because if you change the initial condition x0 then site t of x0 is all the possible trajectories that go from an initial condition x0 up until time t.
Yeah.
Um the flow can also be seen as a function that maps your initial condition to where it should be at time t.
And this is why it's called the flow.
So this function is quite important because at the end of the day what you want is S1 of X0 to be this to be sampled from to be as if it was sampled from P1 which is your target distri data distribution.
Cool. So now we have our next terminology which is called the probability path and it's denoted PT of X.
So PT of X is the probability distribution of your observations at time t.
So P 0 of X is your initial data distribution and P1 of X is your target data distribution.
So if t is between 0 and one is basically the intermediary probability dist data distribution.
And as you can imagine in order to go from p 0 to p1 you have an infinite number of ways of going from one to the other.
So pt is by no means unique.
There are many ways for for you to go from zero to one.
Cool. And now we're going to talk about a central object of this whole field and it's called the vector field. So vector field is noted ut of x and it is a function that takes in a place in space. So x and a time t and it gives you a vector and that vector you can interpret it as the direction and the speed that particles should move at that location at at that time.
So again, the vector field takes in a a vector which is your position in space and gives you also a vector which is where you need to move and and how fast.
And we're going to see that the vector field is quite central to flow matching.
But before we do that, I know last lecture we talked about the score and the score was also a vector and if you remember the score was also something we used in order to go from noise to data.
So one natural question you may be asking yourself right now is well how does this vector field or velocity relate to the score.
So I'm going to give you the following analogy.
So let's suppose you want to go from your initial distribution to your target distribution. And let's suppose that you're thinking of this in terms of self-driving cars that are distributed, let's say, in a gausian fashion in the middle of the desert.
And what you want to do is to position your self-driving cars in a certain way so that it follows let's say a target data distribution.
So the way you would think of the velocity would be instructions that you would give to your self-driving cars at a given location and at a given time t.
So if the ve vector field was not time dependent, you could even think of it as highways, you know, routes with speed limits. But here the vector field is actually time dependent, which means that at a given location x and at two different time steps t1 and t2, that vector can actually be different.
And in that sense, just following with that self-driving car example, the score would be a bit like a compass that you could see in your car of where high density regions are.
So I hope this is useful to just picture the difference between what a velocity is and what a score is. So these two are vectors, but they have different interpretations. Any questions so far?
Yeah, we've barely started. Uh, so we're going to see um few things here, but I just want to make sure that the notations so far are clear.
Yeah. All good.
Cool. So, we're going to see our first identity here.
So, we introduced the vector field, aka the velocity, because this is the way we're going to have our particles move from one place to another.
And in particular, the definition we're giving to the velocity aka the vector field is given as follows. So what we're going to say is that a change in position will be associated with the velocity at that location at at that time t times dt.
So in other words a given particle in a given location will be moving according to that velocity.
And one thing that I want to tell you is that the trajectory which as a reminder is a whole path that the particle is taking from time t=0 to time t= to 1.
That trajectory is actually going to be unique for a given initial condition if there's a special condition on the vector field which is that if the vector field is lip sheets continuous.
So the reason why I'm bringing this up is because it's going to be important.
And so who knows what lip sheets continuity is? Have you ever heard of this term?
Yeah.
No. Okay. So maybe I can just recap the mathematical definition and maybe how it means what it means.
So lipshit's continuity of a function f is when you have some constant let's call it m such that for all x and y in your space of interest you have your function f that satisfies this relation ationship. So the norm of f ofx - f of y is less or equal than m * the norm of x minus y.
So what this means is that f needs to be continuous but also not vary in a very dramatic way.
So this is the mathematical definition of lip sheets continuity and what I'm saying here is if the vector field is lip sheets continuous then the trajectories x will be unique for a given initial condition and so I'm going to just give you an example of of when trajectories are not unique So, so here we have our OD which is here like so. So, I'm just going to give you an example of if we have a vector field that is not lip continuous how we can have trajectories that start from the same point but that are actually different.
So here the example is as follows. So let's suppose we have uh your vector field which is given by this by let's say square root of x and you only consider by the way x positive or zero.
So if your vector field by the way is time independent and it's just a function of of x via square root of x then if you say you want your initial condition to be x0 is equal to zero then xt equal to zero is going to be a solution.
But xt = to t ^2 / 4 is also going to be a solution.
So here t ^2 over 4. So like the derivative is t over 2 and here's when you put this also in the square root it's also t over two. So all of that to say that your vector field needs to have needs to behave in a certain way if you want to guarantee that your trajectories are are unique and this is lip continuity and you're going to see why that matters.
Cool. So that's the first equation that's going to be important. So the OD here which is dx over dt is equal to the velocity at location x and time t. It's the first one.
The second one that we're going to see is the following.
So I told you that what we want to do is transport our distribution from one place to another.
So when you transport something, one thing that you want to guarantee is that you're not going to lose or gain anything of what you're transporting.
You have what we call conservation of mass.
And this idea is also something we want to put in writing.
And in particular, we're going to focus on the following case. So let's assume you're in a region of space and you consider a very small area and here you have your data distribution that is you know around it.
So the equation that we want to put in writing is around making sure that the evolution of your density at time t is equal to the density that is coming minus the density that is going out or that is leaving.
Do you all agree with this idea? So the evolution of density as at time t is a function of how much density comes in minus how many comes out.
So the left hand side is kind of easy.
It's just partial derivative of your probability path at that region with respect to time.
But then the right sides, I don't know about you, but very intuitively, there's not like a supernatural operator that that does that.
So, what we're going to see now is how we can quantify the idea on the right side of the equal sign.
But before we do that, just want to take a pause. Does this idea make sense?
Conservation of mass by the way. Have you seen it in another topic?
Yeah.
So in physics you have the same thing and it's actually called it has a name.
It's called the continuity equation.
Anyways, so instead of just giving you the formula, we're just going to intuitively intuitively get to it. And in particular, I want us to think about this right hand side of this equation and think for a bit how we want to quantify the inflow minus outflow bit.
So let's suppose you have an operator that's called the divergence.
But right now I'm not telling you anything about what that operator is.
I'm just telling you that the operator is positive when things are coming out more than coming in.
And we're going to see what happens then.
So here when we're talking about coming in and coming out, we're of course talking about vectors.
So here we're assuming for instance if there's more coming out that coming in that let's assume if we are in one dimensional space you have a vector that is such that at first it's small in magnitude and then it's much bigger in magnitude.
So here do you agree with me that whatever this vector is representing there is more coming out than coming in?
You all agree? Okay. So that's one one case where coming out can be more than coming in. So let's let's see the second case. So the second case is here and then let's say your vector is coming out here and coming coming out there.
Yeah, I'm sure you can have more cases than that. But in this 1D example, if you were to represent your vector, let's say F over the dimension X here, what is the magnitude of F doing?
It is increasing.
Okay. So here we just see that partial derivative of x overx is increasing. How about here? So here it's negative and then it becomes positive also increasing right.
Okay. So based on this very back of the envelope intuition, we see that when things are leaving partial der derivative of x over this dimension of f over this dimension x is positive which seems to match the sign that we want from our divergence operation.
So now let's see the other case. So let's assume that we have now the case where we want our operator to be negative when there is quote unquote more coming than leaving.
So here again this uh example. So you have let's say a bigger vector here coming and then um smaller one leaving and then let's say you have another case which is given by this which is just vectors pointing towards that area.
So here you see that f is becoming more and more negative when you project it on x. So here you have partial derivative of f overx which is negative and here you see that it was first positive and then it becomes negative which again boils down to this.
So this is by no means a formal proof but I hope I gave you some intuition that if you take the partial derivative of your function that is outputting a vector over its coordinates at least in 1D it would give you this intuition that if it's positive then it's more coming out and when it's negative then there's more coming So based on this intuition, what I'm now telling you is that there is an operator that is indeed called the divergence that is a function of a vector that is a function of x which is also multi-dimension dimensional and what it is doing is it is just the sum of the partial derivatives of x over each dimension.
So the diver divergence is a scalar.
It's a number.
And if it's positive, it just means that there are more things coming out. And if it's negative, it means there are more things coming in.
So based on that, we want to see how we can use this divergence operator.
Back into our continuity equation example where we want to quantify the fact that we want the net inflows which is how much density is coming in in versus out.
So now a second question is something that we ask ourselves which is okay what what should we use as f what should f be? So we saw the vector field which is a vector that could be a candidate. So do you think the vector field could be a good candidate?
So let's see with an example. So let's suppose you have two points in space let's say x1 and x2 and then you have their respective vector fields.
If you happen to have a higher density in one of the two points then our intuition tells us that you would want the point in the higher density region to transport more things than the one in the lower density region.
So in that case the vector field alone would actually not be a good candidate.
Instead and maybe you have guessed it in order to quantify the movement of density we of course need to take into consideration the probability density.
And so here the actual quantity that we will use is called the probability flux and it is defined by PT which is a scalar times UT which is the vector field.
So going back to the equation that we wanted to put in writing, what we're saying is that the temporal evolution of the density at time t, which we said is simply the partial derivative of p with respect to t.
We're now saying that this is equal to minus the divergence of the probability flux.
And this quantity so minus divergence of pt of x it just means that we're interested in the net inflows of density within our small volume.
So all in all that equation which I'm going to write is nothing else than the continuity equation which you may have seen in physics.
So I'm going to write down and of course okay this is here.
So what we're saying is that the evolution of density at a given location, the temporal evolution of the density at a given location is given by how much density is coming in minus how much density is coming out.
And this just conveys the fact that we're not creating new density out of thin air.
We're just conserving it when time goes from t 0 to to one.
And compared to the OD that we saw before, you can think of this equation as being a macro perspective of how your probability path evolves.
And the OD itself is actually centered on how one particle evolves which is more a micro perspective which we're going to see here. So a single sample you take your sample X at time t and what you're saying is that how fast that particle is going is actually given by the vector field at that location at that time.
So the distance dx is given by ut of x * dt.
So this is the micro perspective.
But where we also saw that this vector field that was coming into the picture in this micro perspective is also in the macro perspective. And what we're saying is given that we're transporting the density and we're not losing it or gaining it, we also have this nice identity that we can use which is also called the continuity equation.
And so there is one expression that you will see a lot in papers.
I'm going to read it out loud. the vector field UT of X generates the probability path. So when people say that what they mean is that UT and PT satisfy the continuity equation.
So now you may wonder well why should we care? So just remember that our goal is to sample an observation from the initial distribution P 0 and to map it to the target data distribution.
And the way we want to map it is we want that particle to go from one place to another using the vector field.
And so what we want is to find a probability path that will allow us to go from P 0 to P1.
And in order to find a probability path, we also need to find the associated vector field that will allow us to actually do the mapping because that vector field is in the single sample perspective.
So that's the idea. We're going to see it in other words in a little bit. But this is the intuition behind flow models.
Okay. Any questions on this?
Everything is clear so far.
Yep.
So question is the solution of this equation would be the target. So on the left hand side if if we know UT and if we know that UT is allowing PT to go from P 0 something that we decide that is easy to sample to P1 what we care about what we're saying is if we know such a UT then on the left hand side if you sample X from P Z, you have your initial condition.
And if you know UT, what that means is you can solve this equation in order to find X1.
And X1 by definition will be as if it was sampled from P1.
So that's the idea.
I know it was a lot. I know it may not seem clear but don't worry we have a lot of examples down the line.
So let's move on. Oh this exactly uh my next slide actually. So here what we're saying is the relationship between these two perspectives. I'm going to just repeat what I said but in mathematical term is if we sample x0 from p 0 given that our particle x is subject to velocity then what that means is if that velocity is actually generating the probability path from the initial to the target one then what that means is that XT will be sampled from PT and in particular T equ= 1 X1 will be sampled from P1.
Okay, cool. So don't worry, we're going to go into more details on this.
So the whole goal of flow models is to map points is to map x0 to x1.
So what is our strategy here? We saw that the vector field was a central piece of this whole problem. So number one, the step number one is for us to estimate this vector field via some model that we're going to note UT theta of X. So theta is the parameters that we're learning.
So if we learn such a vector field, then what we're going to do in order to map X0 to X1 is we're going to use the OD and solve it numerically. So what does that mean? That means that we're going to sample X0 and we're going just going to follow the vector field up until time equal to 1.
So this integral here in practice is going to be just a numerical solver that maybe you know like oiler for instance and it's going to be giving you x1 which is what you want because at the end of the day you want to be able to sample from your target data distribution.
So starting from now our full focus will be around finding the vector field UT theta of X because right now I just told you what we care about but I didn't tell you how we're going to do that. So now we're going to focus on that.
So in the past there's been many papers published on this on how you can estimate your vector field and there is one method that I'm going to mention here which is not going to be the focus of this class because it is a method that was used um number of years ago so eight years ago I believe but it is not something that could be used in practice so people are not using it. so far but I'm just telling you just for historical purposes.
So one way of learning the vector fields is by maximizing the likelihood that your model is seeing the data right the training set. So how can you do that? Well if you take the continuity equation that we saw so this one.
So if you take this equation then what happens is if you do some math you can actually pop up that log of pt of x which is uh what you want to maximize and you can have the following strategy. So you want to determine log of p1 of x1.
So x1 is your the data from the training set. So you want to compute the probability of your model seeing x1.
The problem is you you don't know how to compute it.
So the way you're going to do this is you're first going to realize that you can compute the probability that the initial distribution is seeing x0 because typically the initial distribution is a is an easy one something that you can analytically compute such as a gausian. So you know how to compute log P 0 of X0.
And what you're going to do is you're going to link X1 and X0 by integrating the divergence of the vector field because you have the formula above that is true. So if you know log P 0 of X0 and if you know the vector field then you can compute this quantity. And so what that paper was doing which is called continuous normalizing flow was that it was trying to learn the parameters theta by maximizing the likelihood of your training sets happening under theta and it was h it was computing the likelihood using this formula. So what is the problem? The problem is that at training time, you need to constantly solve, you know, this integral, so it's a numer numerical solver that that works all the time.
So, it's very expensive and very slow, which is why we have a a snail emoji.
So, it's just not practical. People just couldn't couldn't use that.
And for that reason, I hope I did a good job of motivating the reason why we needed another approach. The approach we're going to see now is actually what is called flow matching because we're going to see we're not doing maximum likelihood estimation. We're instead trying to directly learn the vector field.
So if you remember for score matching you were directly trying to learn the score. So here we're directly trying to learn the vector field but the problem is that you don't know what it is.
So we're going to see how we can compute this loss and indeed optimize on theta in order to find our vector field.
So just remember that the problem that we have is we have an initial data distribution and a target data distribution.
And what we care about is finding this UT of X which will allow us to go from P 0 to P1. That is our goal.
Well P 0 is easy. So it's a gausian typically. But then P1 can be super complicated.
So one idea we can have is what if we just remember that the target data distribution is just composed of points from your training data. So it's like individual points.
So what if instead we simplify our problem and we ask ourselves how we can transport the initial data distribution not to the target one but to a single point.
How about if we just move our whole data distribution from initial to a single point. And here I noted the single point with the distribution delta x1.
So delta x1 is the notation for a dra del da distribution. So who has heard of the direct distribution? Yeah.
So direct distribution is actually probably the simplest distribution out there because it's a deterministic distribution. It will always give you one quantity. So here delta so I'm going to write it down.
So delta x1 of x is going to be zero if if x is different than x1 and otherwise it will be actually a very high number.
So typically people say let's say infinity but of course it's infinity but the integral of this distribution like the density needs to be one.
So long story short is that we simplified our problem from initial data distribution to your target one. So we we change from that formulation to initial data distribution to a single point.
So one idea could be for us to think of the following way of going from one to the other. So let's suppose you have a probability path that is super simple. It's just a gausian distribution that is being interpolated between zero and x1.
So here what I'm saying is we how about we think of the way we want to go from zero to the point by taking our normal distribution of mean zero and variance identity and how about we just scale it down scale it down and then move it towards X1 until it becomes just that So here PT so we say X given X1 because it's a conditional probability pass. It's like given we're arriving at X1 at time t= 1.
It will be given by a mean of t x1 and a variance of 1 - t ^²* identity.
So let me tell you why that is the case.
So what I'm saying here is that I'm proposing a way for us to go from the normal distribution that is this one of mean zero and then variance the identity. So what I'm saying is in order to go from here to this point which is x1 what I'm telling you is we're going to have a path pt given x1 which is a normal distribution of as a function of t because it's probability path at time tx1 and 1 - t^ 2 i. So when t equals z this is zero this is is identity. So p 0 is just the normal distribution of zero and i and p1 is well t equal 1 is like just x1 and this one is a variance of zero. So you can think of t just tending to zero.
So P1 is just a dra distribution.
So this probability path that I'm telling you is something that is letting us go from our initial data distribution to this point. So it's a it's a valid one. So why am I choosing this?
Well, it can be shown that one vector field that is generating this probability path has the following expression. So, ut of x given x1 is x1 - x over 1 - t. So, I'm just giving you the formula. So, you can derive it. So, I'm looking at the time we will not have time. But the reason why I proposed the probability path, the conditional probability path is because it's much easier for us to compute a vector field and it's something that we can actually write and it is this one. And what I'm going to tell you is that this vector field is generating the conditional probability path.
So in other words, the vector field will be such that if you sample from P 0 of this conditional probability path and if you follow this conditional vector fields, what I am telling you is that you will end up at time equal t with XT following your conditional probability path.
Well, that's a huge step forward. No.
So how do you prove that by the way? So you just use the quantities you use the vector field that is just this formula and you use the conditional probability path which you can analytically derive because if you remember the gausian distribution we we just know everything about it. We also know the density uh formula.
So if you take both formula and if you if you use the continuity equation which is here you can show that this expression is true and this is how you prove that.
Now, we're not going to do it because this is just some math.
But maybe one thing I will tell you is why why that vector field makes sense.
So we said that uh okay we said that the conditional vector field which is noted u t of x given x1 what we said was that this was equal to x1 - x over 1 minus t.
Well, when t is approaching one and when your x is not near x1, then what you're seeing is that this quantity goes crazy, goes very like close to infinity, like it goes very very high. Well, why does that make sense? It is because at time t equal 1, your distribution is literally a deterministic distribution that is x1.
So you want all your particles to be exactly at X1.
So if you're not at X1, then you better hurry. Your velocity will be very very high.
So that's one thing I want to say. And one other thing I want to say is there is a very nice expression of of the conditional vector fields.
If XT is drawn from that conditional probability path which as a reminder is a normal distribution of mean tx1 and 1 minus t^ 2 i.
So if xt is drawn from such a conditional probability path that means that you can express xt as follows. So t x1 + 1 - t x0 with x0 drawn from a normal distribution.
So what that means is ut of xt given x1.
If I were to just replace XT in this formula, what I would obtain is X1 minus. So XT I'm just going to directly replace just because I don't have space. So T X1 + 1 - T X0 over 1 - t.
So x1 - t x1 is 1 - t * x1 minus 1 - t * x0 over 1 - t. So what that means is this is equal to x1 - x0.
The conditional vector field for XT which is drawn from a conditional probability path is nothing else than X1 minus X0 which is very nice.
So why is it very nice? Well, it's tractable and it's going to be useful for us.
Okay.
Any questions?
Yeah.
So the question is how do we get to that equation?
So the question is where does this equation comes from? So this one the one above is just an equation that I'm just telling you. I'm just giving you and you can verify that it actually is generating the conditional probability path. You can prove it but we're not just not doing it. So given that this part is just me telling you that this expression can be even further simplified if the x that you're putting here is actually drawn from the conditional probability path. So as a reminder the conditional probability path is given by a normal distribution of mean tx1 and 1 minus t ^ 2 i.
So if you were to draw from such a distribution, it would be the same thing because you know X1, it would be the same thing as sampling from just a standard normal Gausian and writing XT as follows because this is your variance. This is the mean. So this is the mean plus the variance times something that is run from the normal distribution. And so what I'm telling you is if you were to plug in that expression in here, then a lot of simplifications occur and you end up with a very simple expression of the conditional vector field at that point being equal to just x1 - x0 where x0 is is basically this component of xt.
And follow-up question is, well, why are you telling me this? It's a very fair question. I'm going to tell you in a in 15 minutes, but does everyone agree with me so far?
Going to take that as a yes.
Okay.
So now that we looked at a very simple case of transporting our data distribution from this initial one to a single point. Now our goal is going to figure out how we can map that back to our initial problem. So our initial problem is to map from the initial data distribution to the target data distribution.
So an idea to obtain the target data distribution is to replicate this conditional path over all points of your target data distribution.
So if we do that we consider let's say one point of your target data distribution we have this conditional probability path.
we consider a second point and let's assume here that our target data distribution our training set only has two points. Then what that means is that at time t we can consider what we call a marginal probability path. So marginal because we marginalize across the x1s and we will end up with a mixture of gausian where each gausian is interpolating between the normal one to the single point.
So let me tell you why I'm telling you all of this.
So just remember that what we care about is to map our initial distribution to the target one.
So there are many ways to go from one to the other. And what we say is that we can just propose this marginal probability path which I'm going to explain in a second.
So what I'm telling you is let's fix x1 and just consider the conditional probability path. So pt of x given x1 which is what we saw right before.
Now let's consider all such conditional probability path across all the x1s that are composing our target data distribution and let's consider the following expression which is a natural way of of defining it.
So we notice that P 0 of X so P 0 of X given X1 is just the standard normal distributions P 0.
So this is indeed the initial data distribution. So N so the standard Gausian and P1 what is P1?
So we said that when t = 1 pt of x given x1 is the dra.
So this so when t = 1 this one is equal to drack of x1 at x.
So this whole thing simplifies to p data of x.
So this marginal probability path is verifying p 0 like t equals 0 is equal to your initial data distribution and p1 is your target data distribution. So it is a valid path to go from 0 to one.
So now our goal because just remember what we want is to go from 0 to one and we want to learn the vector field that is getting us from 0 to one. So the good thing here is we have a probability path that is getting us from 0 to one.
So now let's look at the vector field.
Let's see if there is a way for us to aggregate the conditional vector fields such that it is going to enable us to go from 0 to one. So here we consider the conditional vector field for each data point like conditional on each data point. So this one and this one. And what we want is to also have a marginal vector field. Something that is able to generate the marginal probability path.
And what I'm going to tell you is that the way we want to aggregate the vector field is going to actually be quite intuitive.
So I'm going to just write write it down.
So up until now what we're doing is we're constructing things.
We're constructing things and then we're verifying that it's actually something that is what we want. So what I'm saying is we're going to aggregate the conditional vector fields by considering the following aggregation. So I'm going to take my conditional vector fields. I'm going to aggregate it as follows. So I'm going to say it's equal to this.
So pt of so the conditional probability path times p data of x1 over the marginal probability path which we have derived a short while ago. And I'm going to say that this is my definition. This is my proposed definition of the marginal vector field.
So this expression may seem scary to you, but if you remember B rule, this one, you recognize B rule here, right? Like a little bit. So it's as if, so I'm going to put this in quotes.
So it's as if it was the probability of x1 being where you're headed given x which is where you are.
So the way you're aggregating your conditional vector fields is okay, I'm at point X and I'm going to weight each conditional vector fields with a weight that is telling me the probability of this X1 being my destination given where I am.
So this is so you can think of this as the posterior mean and this coefficient can be interpreted as okay given where I am how likely is it that this point is where I want to go and now I'm going to tell you the biggest result this marginal vector field which we defined is actually generating the probability path the marginal probability path.
So this is a big deal. So what that means is there is a vector field that is generating the probability path that is actually something that satisfies what we want which is it's a path that goes from the initial to the target.
So I'm looking at the time. Um I'm not going to derive it, but I'm going to derive something else instead. So in order for you to show that this vector field is generating the probability path is by just plugging these quantities in the continuity equation. So I'm not going to do it. I'm just going to tell you how you would do it.
So we just do that. So you take your marginal property path, you derive it here and just just recognize that it is indeed the minus the divergence of the marginal property path and the marginal vector field.
So what this is telling us is if we sample x0 from p 0 which is the initial one then if we follow the marginal vector field then xt is going to be following the probability path p of t.
This is huge because if you take x0 and you take it all the way to t equal one with the vector fields then x1 will be actually following p1 which is what you care about.
So this is huge.
This can be proven mathematically. We're not going to do it with respect to how much time we have.
And so this leads us to our initial problem which is us wanting to learn a vector field that will allow us to go from P 0 to P1.
So here what I'm going to tell you is that optimizing for the flow matching loss that I told you at the very beginning is equivalent to you optimizing for what we call the conditional flow matching loss which is given by here and I'm going to tell you more about this for the next five minutes.
So let me take this.
So here what we want is to have a tractable loss.
And if you look at the flow matching loss, it is still a loss that is quite complex. So our goal is to really make our loss as tractable as possible.
So as a result what we're saying is that the loss of flow matching which is given by this. So it's the expectation of the square distance of your learned vector field minus the marginal vector field that you constructed.
So what we're saying is optimizing for the loss above is the same as optimizing for the conditional flow matching loss which is this one.
So if we aggregate over time steps t and x drawn from the probability path pt of this expression. What I'm saying is that optimizing for this loss is the same as optimizing for the conditional flow matching loss where the expectation is over t x1 and then x belonging to the conditional probability path.
And back to your question, this whole expression that we had here was ut of xt given x1 which was x1 - x0.
So this thing is super simple.
It is just the thing that you want to learn minus x1 - x0.
Super simple.
Okay. So now why is it equivalent to optimize on the flow matching and the conditional flow matching loss? That's another question. So if you remember for score matching we had a very similar question. So I'm going to give you the same trick.
So let's assume you have distance squar of a minus b which is basically what we have here.
And what you want is to optimize these losses with respect to the parameters of the model theta.
So everything that you care is only terms that are a function of theta.
So a minus b that normal squared is equal to a^ 2 - 2 dotproduct of a and b plus b ^ 2.
So ut theta x^2 this and this are the same. Perfect.
So how about this term? the B term is actually not dependent on theta.
So we actually don't care.
So A a is good is the same. B is not a function of theta. We we don't care.
The only thing that is different between these two is the dotproduct of A and B. So this one and this one and this one and this one.
So what we want to show is the expectation over tn x of this dotproduct is the same as the expectation over these. So t x1 and x over the conditional property path of this and this one. We want to show that these two are the same.
So I'm going to do this in two minutes.
So okay so let's suppose so we're going to go from one side and end up with another so what I'm going to do is start from the dotproduct of the conditional flow matching loss and end up with the one from the flow matching one. So what I'm going to compute is the expectation and here I'm going to say t x1 and x of the dotproduct of ut theta of x and your conditional vector field. So ut of x given x1.
So what I'm telling you is by definition of the expectation.
So this is just a bunch of integrals.
So this is the integral over t over x1 and x of ut of theta of x dotproduct with the conditional vector field ut of x given x1 and this is over x1 and x. So I'm going to say it's p data of x1 and then the conditional probability path x given x1 uh dx1 dx dt.
So far so good.
So here what we notice is that if we put this quantity here the this quantity here and if I divide by pt of x and I do times pt of x then what I'm going to recognize is the marginal vector field that we have constructed.
So this is going to be equal to the integral over t integral. So this integral over x1 is also going to come here.
So integral over t integral over x of ut theta of x and this ut of x which is the marginal vector field times this prob marginal probability path pt of x dx dt t and this is nothing else than the expectation over t and x which is drawn from the marginal probability path of the dotproduct of ut theta of x and the marginal vector field which is exactly what we want.
So what I'm showing you here is that the marginal vector field the way we constructed it allows us to optimize on a loss the conditional flow matching loss which is much more tractable and leads us to the same gradients with respect to theta.
And so long story short, we have this loss which is called the conditional flow matching loss which is the expectation over t x1 which is drawn over the p data and then x drawn over the conditional probability path of a quantity that is extremely simple.
So it's just the square distance between the learned velocity which is what you want to learn and the conditional vector field which we saw was nothing else than just x1 minus x0.
So it's a very tractable loss and it does exactly what you want.
Any questions? So it's a lot Are you all clear with this?
Okay, I'm going to say that it's okay if you feel like it's a lot. It is indeed a lot. So, I'm going to recap the strategy.
So what we said at the very beginning was that we had two different views of how we want to go from one distribution to another distribution.
What we said was that there was this quantity called the vector field aka the velocity that was linked to how we could move our particles. So with the OD dx over dt equal vector field that one but also the continuity equation that linked the vector field with the probability path.
So what we want is to find the vector field that will allow us to go from p 0 to p1.
So what you need to do is to find first a probability path that goes from P1 to sorry to from P 0 to P1.
First you need to find that and then you need to construct a vector field that indeed allows you to go from P 0 to P1.
So how do you do how do you do that?
Well, you propose some quantities and you try to see if the continuity equation is indeed satisfied.
So that's step number one.
Once you know your vector field of interest, your goal now is to learn it.
Because if you learn it, then what you can do is to integrate the OD.
Why do you want to integrate the OD?
because you want to go from x0 to x1.
Therefore, you need to learn your vector field.
And here what we did is in order to learn the vector field, we first thought of a very simple case of us wanting to go from the initial to a single point via the conditional probability path and the conditional vector field.
So this is the simple case with the direct distribution being the target distribution.
And then what we said was okay given this very simple case that is super tractable let's try to match our initial distribution with the actual target one of interest which is the whole data distribution not just a single point but the whole data distribution and we proposed the marginal probability path which uh satisfied t=0 equal p 0 and t= 1 it's equal to p data. So it was it was a valid probability path and we also saw that the constructed vector field was actually generating that proposed marginal probability path.
It was generating it. Why? because it satisfied the continuity equation which I said I don't have time to derive but you just need to replace the quantities and see that the continuation continuity equation is indeed satisfied.
So given that we were finally with a vector field that was something that was adapted to our use case but now what we wanted to do was to learn it with attractable loss.
So in this last part, what we did is show that the loss that we wanted to optimize was actually equivalent to another loss that was extremely tractable, which is the conditional flow matching loss.
And the way we showed that was by showing that the gradient of these losses with respect to theta were equal, which is what I did here.
And then finally we have our final last function which is ut theta of x - x1 - x0 expectation of that.
So it's incredibly simple. So everything I told you is actually so a lot of it is coming from this paper that's called flow matching for generative modeling.
So, I highly recommend reading it. It may not be the simplest to read, but I hope what I just told you actually will allow you to understand the formulas in there a bit better.
But I didn't want to give it to Shervin before telling you a bit more about why we're calling this whole thing flow matching.
Because score matching it was called score matching because you try to match the score.
But here we're not matching the flow.
We're matching the velocity. So one natural question is why is it not called velocity matching? Why is it not called vector field matching? Why is it called flow matching?
So these points are a bit more informal.
It's just a personal take on why that is the case. So one potential reason is historically these models which are flow models meaning trying to map a point to another via this site t of x0 they had the term flow in their names and in particular the method that I mentioned that tried to learn the vector field with maximum likelihood estimation that one was called continuous normalizing flows. So one reason is maybe branding but actually I want to get back to my earlier lip sheets continuity comment.
It turns out that if the vector field that you learn here is lip sheets continuous then what happens is that your trajectories they're all unique.
So you actually have a mapping between vector fields that you learn and the flow.
So if you're matching the velocity, you're also matching the flow.
Now you may be wondering, okay, why how do I know that the learn vector field is indeed lip sheets continuous?
Well, we have not seen that yet. We're going to see that in lecture five.
But the model that is behind this learned vector field is something that is composed of matrix multiplications and some smooth activation functions.
All of that is lip sheets continuous.
So what you will learn is actually a lip sheets continuous vector field with respect to x which is going to give you a onetoone mapping between velocity and flow and we have two minutes so bear with me.
In two minutes I'm going to cover how we train and do inference but it's going to be very easy.
So let's see how we do training. So as I mentioned the loss is extremely simple and so what we do is we sample the noise so x0 from your standard gausian distribution your clean image so x1 from your training set which is p data and you also sample your time step and what you do is you construct xt your noised image and what I'm telling you is thanks to the loss function that we derived.
What we do is we use XT the noise image and t to predict x1 minus x0 which is the conditional vector field via this model that is your vector field model and you're going to compute the loss and back propagate.
That's how you do training. Now how do you do inference? Well, inference you just sample from an easy data distribution. And here it's again standard normal distribution.
And what you're going to do is to numerically solve the OD.
So what you do is you go time step to time step and you use your learned vector field and you end up with your final estimated image X1.
And that's how you do inference. And with that, I'm going to give it to Shervin back.
So now we're going to see in a more visual manner the training process and discuss some edge cases and how to solve them in practice. So you remember the training process that Afin just mentioned.
You sample X0 from the normal distribution. You sample x1 from your training set and what you do is that you draw a straight line between the two and you try to fit the velocity to this x1 minus x0.
Now I want you to imagine that we do that for a given pair and now for another pair and suppose that these pairings intersect at a given point and let's suppose that this intersection is for a same t. So in other words, we're trying to fit the loss at a same x at a same t on two different vectors.
So who knows what the resulting learned velocity will be.
So it cannot be two velocities, right?
We cannot have a two vectors at a single point. It has to be one.
So we have this MSE loss that will make it such that it will be an average of all the vectors that we try to fit at that given point X and at given point T and do you see that as being a problem or so what does that mean?
So here in practice, let's say you're at inference time and you want to go from let's say x0 to x1 and let's say you have learned a vector field of velocity that doesn't go to x1 that goes in an average direction. Are you going to be able to go to x1?
No. Right?
So the thing that will occur is that you're going to have something averaged that might repel the trajectories that you try to learn into different directions.
And but be reassured we showed that this learn vector field was such that if you sample a point in the normal distribution and follow the learn vector field you land into a distribution land that will be your distribution p data.
So that part is fine. But still you have this property that the learned trajectories are not going to be the same as the ones that you teach your model. And that is a source of what I would call complexity where you teach a model that doesn't follow what you said at at test time. So that is inherently so your intuition could make you feel that this is not efficient.
And the second thing that I want to say is that you could have this phenomenon happening even though the uh points of intersection are not at the exact same t because in practice you're going to learn this vector field with a neural network and neural network uh are going to be a class of functions that are going to be smoothing things in space. So you won't have a drastic change of velocity for neighboring T's and uh yeah so because one um one thing that you could have told me is but Shervin that is never going to happen 01 is an interval with infinitely many numbers you're never going to draw the same T and you would be right and I'm just saying that it could be valid for neighboring TE's So and more importantly um let's say you sample your point x0 out of your normal distribution. What is going to be your issue here? I've mentioned that you want to integrate the OD at inference time and we had seen last time that you want to find methods that optimize that number of NFS in such a way that you can find your target sample in a minimum amount of steps and when you have a path that is getting curved so you don't have these straight lines at inference time you're not going to be able to solve that in a single step because let's Say you use oiler. Oiler is estimating the velocity at the beginning point and if you try to do it in a single step, it will end up somewhere else. It will follow a straight path whereas the curve isn't going to be taken into account.
Um, and I know you're going to tell me, but Shervin, you have tricks in your pocket. Last time you talked about a DPM solver in a different setting that was able to reframe things in an exact way.
Well, no. This time the OD that we want to integrate has no linear component.
It's all nonlinear because here we're directly learning that velocity field.
So no DPM solver here. Uh you need to use traditional um sampler techniques.
Um, okay. So, not quite, not quite happy. We're happy, but not quite. Uh, so now I'm going to walk you through the intuition of a retraining or fine-tuning procedure um that is called reflow.
And um so first I'm going to recap the training procedure that we have already.
So you sample a point x0, you sample a point x1, you draw a path and then you uh fit a velocity on it. So it is called in that paper a one rectified flow model.
And you obtain trajectories that we we saw were going to be curved.
And one idea that we have here, one temptation is to locate these new pairings. So you started from point x0, you integrated the OD, you ended up at somewhere in the distribution of P P data. You're tempted to use these new pairs as your new pairings because they will be by construction in such a way that the the velocity field will carry the trajectory directly there. Right?
So the reflow procedure says okay let's take these new pairs as those that we use to retrain the model.
So this would be our our new um lines that we fit our new model and uh so this is the reflow step and you do that process again and again until your lines becomes straighter and once that is the case at inference time you're able to use a limited amount of inference steps in order to get your sample um in in the P data space.
So, does this procedure make sense? Yep.
So, the comment is if you retrain then uh the two points are going to be from the new distribution. So, the magic in all of this. So first you're going to keep that gausian distribution at the beginning that will not change and at the end we're going to see that there are some properties that makes it such that the resulting points that you get still preserve the distribution p data.
So there there is some theoretical justification that that it does follow P data but um this is a great point. We're going to discuss that in more detail in two slides. Yeah.
Any other questions or comments?
Okay. So, you're right to be skeptical.
So, this works because of two things.
So, first uh so actually one slide. Uh so, we need to show that this follows the distribution P data. So, I'm not going to give you the whole proof just the gist of it. So you start by uh defining this random variable that starts at at x0 and you integrate the OD to end up in the new space and define this uh velocity um for the x random variable and the the proof to show that the law that you get for the end distribution after a reflow procedure goes through showing that these two probability flows respect the same continuity equation and with some uh uniqueness theorem and the presence of uh initial conditions at the fact that pt like equals zero is still the gausian distribution you can show that it's valid for for all t and in particular t equals 1. So yeah, just handwaving the proof here. So it's theoretically the case, but you are right to be skeptical because once you integrate the OD, you have errors that that come in from two sources. So the first error is the discretization error.
So as we said, we're sampling these new points. So maybe you have some errors in the way you do these steps. So you don't necessarily end up in the theoretical u location you should be at. And the second point is that you're operating off of a learned vector field that is approximated by a neural network. And it it is possible that you haven't learned it perfectly. I mean of course it's sure that you haven't learned it perfectly.
Um, and the quality of your results will be a function of how well you learn that vector field and of your discretization errors. And we're going to see very soon that this is one of the reasons why we don't perform this reflow procedure too many times because the degradation is going to be um a trade-off to to balance here.
Okay, so this is the first point I wanted to justify. The second point I wanted to justify is we handwaved the fact that the flows like the trajectories were getting straighter and there is a mathematical way of writing it down and proving it and this is also something that is being proven in case you're interested and for the interest of time I'm not going to be covering it but that s um quantity quantifies straightness and compares compares the actual velocity with the theoretical one and we show that it decays with a a good speed.
And this is why we're saying it it's getting straighter.
And exactly as we discussed, the reflow procedure is one that you wouldn't want to do too many times, but usually one procedure is already great. And the goal is for us to use uh very simple sampling techniques such as Oiler um optimally in just a few steps finding the the right solution.
And of course it it isn't a free lunch.
It it comes at a trade-off. So you're trading off simplicity and potentially lower quality for faster inference time speeds.
Okay, great. So these past three lectures including this one we've come across these concepts of diffusion score matching and now flows flow matching and I want to spend five minutes together to tie all of these frameworks together in a way that will hopefully show links between them and the fact that we are solving the same category of problems but with from different lenses.
And in order to do that, I'm going to look at each of the processes that we defined and then look at the difference with respect to each method.
So the first one is the forward process.
So it's a long time ago already two weeks ago we talked about DDPM and all this class of discrete time diffusion.
Um and then we saw in scorebased diffusion world that if you reframed these equations uh and converted it into uh like continuous time put some delta t to zero you could get some s formulation that was equivalent to it.
uh where we saw at the DPM solver section that actually in diffusion space you could approximate I mean you could always assume that f of x and t was linear in x and this is why so it's something that we haven't seen so far but I'm telling you that you have a nice closed form expression with respect to f and g of xt and now we saw today that you can write down XT uh as an interpolation between X0 and X1 in flow matching. And now we're going to see that in all of these methods, you start from a clean sample.
So in a diffusion space, we start uh the clean sample from t equals0. But in flow matching space, we saw that the notation was reversed. Uh clean sample is t= 1.
And on the other side of the spectrum you have noise uh which was in diffusion world t equals big t and here it's t equals z. So you're here uh drawing an interpolation between the two in each of these cases.
Um now when we look at the loss functions we saw that the diffusion u view of writing things down was focusing on noise. uh the scorebased uh the score matching view was uh linking scores and you had a nice relationship between noise and score and in the flow matching world you were directly fitting the velocity field such that you have the simplest OD to integrate in the end.
Now when we look at the generation process, we saw in the diffusion world that you had all these steps uh to go through with some stoasticity which you also had in the scorebased diffusion world.
And you had for the first two a deterministic counterpart. And you will notice that the flow matching one doesn't have such a balance because it's deterministic to start with.
And I'm just showing you some relationships between all these methods.
But actually there is a whole theory linking these concepts in a very nice way. And this is why I recommend if you're interested, so it's out of scope for this class and just if you're interested uh to take a look at the paper called stochastic interpolence that writes down equations that links all of these together and turns out that out of noise score and velocity, if you know two of them, you can deduce the third one. And that paper designs a problem space that is bigger than what each of these frameworks detailed. And um and and all of these frameworks can live in in the same unified framework.
And with that, I hope you have a great weekend.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











