Starmer masterfully deconstructs complex statistical frameworks into intuitive visual narratives that prioritize conceptual clarity over rote memorization. It is a definitive example of how true mastery lies in making the sophisticated seem remarkably simple.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
The Essence of Linear RegressionAdded:
Fit a line to your data, then figure out if it's any good at making predictions.
Hooray!
Stack Quest.
Hello, I'm Josh Starmer and welcome to Stat Quest. Today we're going to talk about the essence of linear regression.
This Stat Quest is brought to you by the letters A, B, and C. A always B Curious.
Always be curious.
Oh no, I hate to do this, but this video is going to start with a shameless self-promotion.
If you want to master statistics, check out the Stat Quest Illustrated guide to statistics. It's over 300 pages of awesomeness, and it includes hands-on tutorials in Python and R. Bam.
Now imagine we worked in the corporate offices of spend and save food stores and our boss asked us if we should build three new stores. To make matters worse, our boss also asked us to quantify our confidence in the decision we make. In other words, if we decide to build three new stores, how much confidence do we have that we made a good decision? If we're very confident, then we can start building right now. Otherwise, we might want to wait until we are more confident that it's a good idea to build three new stores. Hey Josh, how can we make this decision and quantify our confidence in it? Great question, Squatch. One way we can make this decision and quantify our confidence in it is to use linear regression.
Now, in order to learn about linear regression and how it can help us make decisions and quantify our confidence in them, let's imagine we counted the number of stores for three companies and calculated their revenue. To make it easier to talk about the individual data points, this point represents a company owned by Squatch.
This point represents a company owned by Norm.
And this point represents a company owned by our new friend, the Gamma Monster, who goes by gamma. Roughly speaking, it seems like the more stores a company has, the more revenue it makes. This means that we might be able to use the number of stores a company has to predict their revenue. However, the trend is not perfect.
For example, Norm's company, which has the most revenue, has fewer stores than Gamas.
However, even with an imperfect trend, one way we can predict revenue with the number of stores is to fit a line to the data points. Unfortunately, it's not super obvious how we should fit a line to these data points. For example, things start out well if we draw a line that goes through the points for Norman Squatch.
Because given Norm's number of stores, 12, the y-axis coordinate on the line, which is the revenue value that the line predicts, corresponds exactly with Norm's revenue, 12.5.
Likewise, given the number of Squatches stores, two, the y-axis coordinate on the line, the predicted value corresponds exactly with Squatch's revenue, three. However, the predicted value for Gamma's revenue is very different from the known value. Gamma's actual revenue is seven. But the y-axis coordinate on the line, the predicted value is 15.3.
In other words, the line makes a pretty terrible revenue prediction for Gamma's company. Similarly, drawing a line through the points for squatch and gamma results in revenue predictions that are identical to the observed values for squatch and gamma. But for norm, the y-axis value on the line is much much lower than the observed value.
And a line that goes through the points for norm and gamma results in a predicted value for squatch that is so high it goes off the screen.
In other words, compared to the observed revenue value for squatch, the prediction is absolutely terrible.
Alternatively, we could try drawing lines that go between the points, but it's not obvious which line would make the best predictions.
To decide which line makes the best predictions, we need a way to quantify the quality of the predictions.
One popular way to quantify the quality of the predictions made by a line is to calculate something called the sum of the squared residuals.
As the name implies, we start by calculating residuals, which are the differences between the observed values and the values predicted by a line.
For example, given this line, the observed revenue for Gamma's company is seven units and the revenue predicted by the line is 11.5.
So the residual, the difference between the observed and predicted revenue values is -4.5.
And we can illustrate the residual for Gamma's company by drawing a vertical line between the data point and the line.
Note, a lot of people ask why the residual is parallel to the y ais instead of being perpendicular to the line like this. The reason the residual is parallel to the yaxis is that it allows both the observed and predicted revenue values to correspond to the same value on the xaxis.
In other words, both the observed and predicted revenue values correspond to the same number of stores 15.
Anyway, we can also calculate the residual for Norm's company as the observed revenue value 12.5 minus the predicted revenue value 10.5 and that gives us two.
Lastly, the residual for Squatch's company is boop -3.
Now remember, the whole point of calculating the residuals was to give us a sense of the overall quality or accuracy of the revenue predictions made with the line.
However, if we just add the residuals together, we get -5.5.
But we also run into a problem.
To illustrate this problem, let's compare the residuals associated with this blue line to the residuals associated with this black line.
Note, to be clear, both graphs show the exact same data points. The only differences are the blue and black lines and their corresponding residuals.
Specifically, every single residual for the blue line is smaller than the residuals for the black line. So, intuitively, since every single residual associated with the black line is larger than the blue line, we would think that the black line should have a worse score than the blue line. However, if we just add up the residuals for the black line, we get the exact same sum of residuals we got for the blue line, -5.5.
In other words, even though the black line fits the data worse than the blue line, both have the same sum of residuals.
Josh, how is it possible that both lines have the same sum of residuals?
The reason that both lines can have the same sum of residuals is that negative residuals can be cancelled out by positive residuals.
Because adding negative and positive residuals together can cause them to cancel each other out. We need to either make all of the residuals positive or negative. And since making them all positive is easier, that's what we'll do. Now, usually when we want to convert a negative number to a positive number, we just use the absolute value function.
However, later on, using the absolute value function will make the math a lot harder. So, instead of taking the absolute values of the residuals to make them positive, we square them. Squaring the residuals ensures that all of them will be positive and later on the math will be easier.
Anyway, and now when we sum the squared residuals for the blue line, we get 33.25.
And when we sum the squared residuals for the black line, we get 355.125.
And since the blue line has a smaller sum of squared residuals, we now know that the black line has larger residuals and thus the blue line makes better predictions than the black line. And that means we can use the sum of the squared residuals, the SSR, to quantify and compare how well different lines fit the data. Bam.
Now that we have a way to compare the predictions from two different lines, let's learn how we can use the sum of the squared residuals to find the best fitting line for the original data set.
We'll start by calculating the sum of the squared residuals for this blue line.
First, we calculate the residuals.
Then, we square them. Then, we add them together to get 133.2.
Now we can plot this sum of squared residuals on a graph that has the sum of squared residuals on the y ais and different values for the y-axis intercept for the blue line on the xaxis.
Now let's increase the y-axis intercept for the blue line and plot the sum of the squared residuals on our new graph.
Now we can see that the new blue line with a larger value for the y-axis intercept has a lower sum of squared residuals than the old one. This result is encouraging because the lower sum of squared residuals means the new blue line fits the data points better than before.
Now let's increase the y-axis intercept for the blue line a little bit more and plot the sum of squared residuals.
This new y-axis intercept value results in an even lower sum of squared residuals than before. And thus, this y-axis intercept value results in the best fitting blue line yet. Now, let's increase the y-axis intercept for the blue line a little bit more and plot the sum of squared residuals.
Now we see that the new value for the y-axis intercept results in an increase in the sum of squared residuals compared to the last value. In other words, this y-axis intercept results in a blue line that fits the data a little worse than before.
Now let's increase the y-axis intercept for the blue line one more time and plot the sum of squared residuals.
Now, the sum of squared residuals is even larger than before.
And that means this y-axis intercept results in a blue line that fits the data even worse than before.
Now, if we don't change the slope and calculate the sum of squared residuals for all possible values for the y-axis intercept, then we would end up with a curve that looks like this. In this case, one goal of linear regression would be to find the y-axis intercept that corresponds to the lowest sum of squared residuals at the bottom of this curve.
One way to find the lowest point on this curve is to calculate its derivative and solve for where the derivative is equal to zero at the bottom of the curve.
Note, one reason we square the residuals instead of taking the absolute value is that squaring makes it easier to calculate the derivative.
Also note, if you're not familiar with the concept of a derivative, don't sweat it. All you really need to know is that we can find the yaxis intercept that minimizes the sum of the squared residuals.
Anyway, solving for this derivative results in an analytical solution, meaning we end up with a relatively complicated looking formula that we can plug our data into, and the output is the optimal y-axis intercept.
However, unless you're taking some sort of insane statistics class, don't bother trying to remember this formula because you will most likely never ever have to plug numbers into it and do the math.
Instead, what's important is to remember that this formula exists and that when you do linear regression, a computer will plug in the data and do the math for you. Bam.
Note, so far we have only illustrated how to find the optimal y-axis intercept for the blue line.
However, the same process can be used to find the optimal slope for the blue line as well. We solve for this derivative and that gives us the slope that minimizes the sum of the squared residuals.
And just like before, solving this derivative results in an analytical solution. In this case, we end up with a formula that's even more complicated looking than the last one. But again, don't bother memorizing this formula.
What is important is knowing that it exists and that a computer can deal with it for you. Oh no, it's the dreaded terminology alert. The name of the method for finding the line that minimizes the sum of the squared residuals is called least squares. Bam.
Anyway, when we use least squares to find the line that minimizes the sum of the squared residuals, we end up with this blue line fit to the data. In this example, the y-axis intercept that minimizes the sum of squared residuals is three. And the slope that minimizes the sum of squared residuals is 0.5.
The blue line corresponds to an equation that uses number of stores to predict revenue.
Now, remember way back at the start of this quest, our boss asked us if we should build three new spend and save stores.
Well, now that we have this blue line and its corresponding equation, we can start to answer that question. For example, if we already have four stores and we built three more, then we'll have seven stores.
And we can predict how much revenue seven stores will generate by plugging seven into our equation to get the predicted revenue 6.5.
In other words, if we increase the number of stores to 7, then the corresponding y-axis coordinate for the blue line tells us that our revenue should increase to 6.5.
And knowing the predicted revenue can help us decide if it's a good idea to build three more stores.
Bam. Sort of. Yes, it's cool that we can predict revenue, but how much confidence can we have in these predictions?
In other words, if we don't have a lot of confidence in the predicted revenue, then we might not want to make any major decisions based on it. So now let's talk about how we can quantify our confidence in the predictions.
Quantifying our confidence in our predictions requires us to quantify two things. One, the accuracy of the predictions and two, the probability that random chance could give us similar or even better predictions.
For example, if we know that a prediction is not very accurate, then we can't have a lot of confidence in it.
Likewise, if a random chance can give us similar predictions or even better predictions, then again, we can't have a lot of confidence in the predictions we make. So, let's learn about how to quantify these things. And we'll start by learning how to quantify the accuracy of the predictions.
One way we can get a sense of the accuracy of their predictions is to compare the sum of the squared residuals around the blue line to the sum of the squared residuals around the mean revenue value. If the sum of the squared residuals around the blue line is a lot smaller than the sum of the squared residuals around the mean, then we have reason to believe that predictions made with the blue line, which takes number of stores into account will be more accurate than predictions made with the mean revenue, which ignores the number of stores.
In other words, we can quantify the quality of the predictions made with the number of stores by comparing them to predictions made without the number of stores. For linear regression and a lot of other things, we can quantify the difference in predictions with something called R squared. In this example, R square tells us what percentage of the sum of the squared residuals around the mean decreases when we use the blue line to make predictions.
Josh, why are we comparing the blue line to the mean and not some other line? The reason we compare the blue line to the mean is that the mean minimizes the sum of squared residuals when we make predictions without the number of stores. In other words, if we used any other horizontal line to predict revenue like this one, then the sum of squared residuals around that other horizontal line would be greater than the sum of squared residuals around the mean.
Lastly, we're using a horizontal line to make these predictions because we're making them without taking the number of stores into consideration.
Okay, calculating R squar is surprisingly straightforward.
We start with the sum of squared residuals around the mean and subtract the sum of squared residuals around the blue line. Then we divide that difference by the sum of squared residuals around the mean. In this example, we'll start by calculating the sum of squared residuals around the mean. And that ends up being 45.5.
Now we'll calculate the sum of squared residuals around the blue line and that ends up being 25.5.
Now we just plug in the values for the sum of squared residuals around the mean and the sum of squared residuals around the blue line and we get 0.44.
The result 0.44 44 tells us that there was a 44% reduction in the sum of squared residuals between the mean and the blue line.
Now to get a sense of what exactly a 44% reduction in the sum of squared residuals means, let's calculate R squared for a few other data sets.
For example, if these were the data points, then this is the best fitting line and this is the mean revenue value.
Now, if the best fitting line and the mean look similar to you, then that is because they are in fact the same. So, let's calculate R squared for when the best fitting line is the same as the mean.
And we get 0.0. zero. That result tells us that there was a 0% reduction in the sum of the squared residuals between the blue line and the mean. In other words, the predictions made by the blue line aren't any better than predictions made by the mean. And this makes sense because the blue line and the mean both make the same predictions.
Note, when we do linear regression, the lowest value for R squared is zero.
Because if the blue line minimizes the sum of the squared residuals, it can't fit the data points worse than the mean.
Small bam.
Now let's look at what happens when we calculate R squared for these data points. In contrast to the last example, now the line fits the data points perfectly.
In other words, all of the residuals are zero.
And when we make predictions without using the number of stores, we end up with some residuals that are greater than zero.
So now let's calculate R squaredoop and we get 1.0.
The result tells us that there was a 100% reduction in the sum of the squared residuals between the blue line and the mean. And this makes sense because the sum of the squared residuals went from 23.2 to 0.
So when we have a line that fits the data perfectly and the sum of the squared residuals around the mean is greater than zero, then r 2 is 1.0 and it tells us that there is a 100% reduction in the sum of the squared residuals.
Note, since we can't have a negative value for the sum of the squared residuals, 1.0 is the largest possible value for R squared.
In summary, R 2 goes from zero when the best fitting line is no better at making predictions than the mean value to one when the best fitting line fits the data perfectly and the mean does not.
Now, let's go back to the original data points. and the original value for R 2 that we calculated 0.44.
Since R2 values go from 0 to 1, we now know that 0.44 is in the middle of possible values. And that means the predictions made with the number of stores aren't nearly as bad as predictions made without the number of stores.
But it also means that the predictions made with the number of stores are not that great either and that there is a lot of room for improvement.
So now we know how to use R squared to quantify the accuracy of the predictions we make with the blue line. Double bam.
Now remember our goal is to quantify our confidence in the predictions we make with the blue line. And that means we need to quantify the accuracy of the predictions.
And we just did that with R squared.
So now we need to quantify the probability that random chance could give us similar or even better predictions.
This is because even if our blue line made awesome predictions and had an R squar value close to one, if random chance can give us predictions that are at least as good, then we still can't have a lot of confidence in them. So now let's learn how we can quantify the probability that random chance could give us similar or even better predictions.
To illustrate how we do this, we'll start with a new data set that only has two points. When we only have two data points, then we can always get a line to fit them perfectly. And this means that for any two random points, the sum of the squared residuals around the blue line will equal zero.
So as long as the sum of the squared residuals around the mean is greater than zero then for any two random points r 2 equ= 1.
This means that any two random points will have r^2 equal to 1 because the sum of the squared residuals around the blue line equals 0 and thus will always be 100% smaller than whatever we get for the sum of the squared residuals around the mean. In other words, R squar can tell us how well a line improves predictions over just using the mean.
But on its own, it's not super useful since any two random points can give us perfect predictions.
Thus, for linear regression, whenever we calculate an R squar value, we also calculate something called a P value.
Among other things, the P value will give us a sense of how much confidence we should have that the predictions made with this blue line will be better than predictions made with just the mean.
For linear regression, there are a few ways to calculate the p value. The most common way is to get a computer to plug numbers into a big fancy equation.
And this is what will happen when you do linear regression on your own.
Technically, plugging numbers into a big fancy equation isn't hard, but it won't help us understand what the p value means. So, in this stat quest, instead of plugging numbers into a big fancy equation, we're going to calculate the p value in a way that makes it easy to understand its meaning.
First, imagine we picked two completely random points. Now, let's fit a line to them. Calculate R squared and add that value to a histogram.
Now, let's pick another two random points.
fit a line to them, calculate R squared, and add that value to our histogram.
Now, just imagine repeating that process thousands of times to build our histogram.
Bam.
This histogram tells us what R squared values we should expect when we fit a line to two randomly selected points.
Now we go back to the original two data points that we observed and the R squar value that we calculated for it one and figure out what percentage of values in the histogram are greater than or equal to one to get the p value. In this example, 100% of the values in the histogram are greater than or equal to 1. So the p value for the R squar associated with the original data is 1.0. zero.
In other words, there's a 100% chance that random data could result in an R squar value at least as large as what we originally got. And that means we have very little confidence that predictions made with the blue line will be any better than predictions made with the mean. Small bam.
This example with just two points results in a pretty dull histogram and an obvious result.
However, things are a little more interesting when we return to our original data with three points and the corresponding R 2 value 0.44.
Now, just like we did before, let's compare the R squar value that we got from our data points to what we get from random data points. Now we start with three random points and then we calculate R squar for them and keep track of that value in a histogram.
Now let's calculate R squared for another three random points and keep track of that value in our histogram.
If we repeat this process thousands of times, we end up with a histogram that looks like this.
This histogram tells us what R 2 values we should expect when we fit a line to three randomly selected points.
Now going back to the original data and the original R 2AR value 0.44, we can calculate the p value by determining the percentage of values in the histogram that are greater than or equal to 0.44.
In this example, 53% of the values in the histogram are greater than or equal to 0.44.
So the p value equals 0.53.
In other words, there's a 53% chance that random data could result in an R2 value at least as large as what we originally got. And that means we can't have a lot of confidence that predictions made with the blue line will be better than predictions made with just the mean.
Now that we know we can't have confidence in our predictions, we finally know how to answer the question, should we build three new spend and save stores.
The blue line in the equation tells us that building three new stores will increase our revenue to 6.5.
But the R squar value 0.44 tells us that predictions from the blue line aren't super accurate. And the corresponding p value 0.53 tells us that there is a 53% chance that random data points will give us predictions that are at least as accurate.
In other words, the p value tells us we can't have a lot of confidence in the predictions made with the blue line. So, we tell our boss that it is possible that building three new stores will increase our revenue, but it might not and we should get more data before we decide to build. Bam.
Yes, because we learned the essence of linear regression and that's kind of a big deal. Triple bam.
Now, it's time for some shameless self-promotion.
If you want to review statistics, machine learning, and AI offline, check out the StatQuest PDF study guides, and my best-selling books on machine learning, neural networks, and AI, and statistics at stackquest.org.
There's something for everyone. Hooray!
We've made it to the end of another exciting Stack Quest. If you like this Stack Quest and want to see more, please subscribe. And if you want to support StackQuest, consider contributing to my Patreon campaign, becoming a channel member, buying one or two of my original songs or a t-shirt or a hoodie, or just donate. The links are in the description below. All right, until next time. Quest on.
Related Videos
Escaping the Fog
LogicLemurGaming
760 views•2026-06-03
Olympiad Mathematics | Indian | Can You Solve This One?
PhilCoolMath
650 views•2026-06-03
A Brutal Radical Expression Made Easy! The Shortcut Changes Everything.
tamoshop
112 views•2026-06-02
V : jee main /advance class 11 mathematics : Binomial Theorem class-1 ( 29 may 2026 )
dcamclassesiitjeemainsadva9953
125 views•2026-05-29
Is This Pentomino Tileable?
3cycle
241 views•2026-05-30
This Sudoku Has Many Lines!!
CrackingTheCryptic
2K views•2026-05-29
Olympiad Mathematics | Indian Can You Solve This One?
PhilCoolMath
268 views•2026-06-02
Olympiad Mathematics | Indian | Can You Solve This?
PhilCoolMath
669 views•2026-06-02











