This lecture provides a clear and rigorous foundation for understanding uncertainty and likelihood in machine learning. It is a solid academic summary that effectively bridges the gap between empirical risk and probabilistic intuition.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Probabilistic ML Lecture 1 : From What is ML, to Empirical Risk and Maximum Likelihood intuition.Added:
So hello folks, I am Aayush and today I'll be offering you a probabilistic machine learning course and this is an introductory lecture. Okay. So in this course we will be covering foundations pertaining to machine learning. Then we will try to cover the aspects pertaining to linear models in machine learning.
Then we study deep neural networks. and after that we study some special topics.
So let's get started. So the first question is what is machine learning?
So there there is there was a famous scientist by the name Tom Michell and what he said what Tom Michelle professor Tom Michelle said is that when you have a computer program and uh that computer program is said to learn from some experience E with respect to some class of tasks D and performance measure P. If with experience its performance on those tasks t increases or gets better. All right. So there are three things. One is called as the experience. Another is called as the task and another is the performance measure. Okay. So basically in this course what really we intend to do is that we look at machine learning from a probabilistic perspective.
probabilistic perspective approach to machine learning. All right.
So now in probabilistic perspective what we actually see is that in machine learning we have some unknown quantity of interest that we want to estimate or know. For instance, uh what uh how likely is it that you know it will rain tomorrow. All right. So we cannot give a definite answer to it. There is some amount of uncertainty. Or if let's say I have some model as a black box and I want to know the parameters of the model. So to estimate the parameters. So basically what we are concerned with is we are interested in some unknown quantity of interest which we call it as random variables.
And these random variables are actually endowed with probability distributions.
So what do we mean by this? So first of all, what is a random variable? So let us uh suppose that I have a coin. All right? And I try to flip a coin. Now when I flip that coin, what is the uh I mean uh likelihood that I'll get a heads or a tails. It is not certain, right?
But let's say when I toss a coin and I got a heads then to do some mathematical analysis you know we need to associate the outcome of the event of getting a head or a tail from our sample space into some numerical value a number on the real line or a numerical value let's say so we need to do a mapping we need a deterministic mapping function so you see that when I flip a coin coin. All right. The randomness is not in that random variable, but the randomness is in how I'm flipping the coin. So, a random variable is neither random nor a variable. Okay? It is not random because it's a deterministic mapping from the outcome to a number. All right? And is also not a variable because it is a deterministic function. Right? It is not a variable. Okay? So random variables we talked about. Now why again why probabilistic perspective? So firstly we will cover in decision theory that a probabilistic perspective approach to machine learning. So basically the probabilistic approach is actually the optimal approach to make decisions under uncertainty and uh that is what we do mainly in machine learning that there is an uncertain situation and we want to uh I mean make some decisions right and we will cover that in decision theory.
So first is the optimal approach to decision making under uncertainty.
And the second point is that uh uh in many areas of like science and engineering we use probabilities right.
So it's always best if you want a unifying theme between other areas of engineering and machine learning, it's best to uh speak in the language of probability.
All right. Now there are um various kinds of uh machine learning.
All right. There are various kinds of machine learning. And the first kind that we have is what is called as supervised machine learning.
Supervised machine learning.
H.
All right.
So what happens in supervised learning?
So I I talked about three things experience, task and the performance measure. So what is the task like in supervised learning is like supervised means we give it some kind of supervision to the to the model like we give some kind of supervision to the process that okay for this input this is the true output or for this input this is the correct label this is the incorrect label right so the task or the task t is basically what we are learning is we want to learn a function approximator remember I talked about random variables that it's a deterministic mapping. So we want to learn that mapping, right? We want to learn the function. So basically the task is that I want to learn some function f which is a map which maps from um the inputs okay from my input space to the outputs in the label space.
So these are called as features or uh covariants and y is called as the label or the response.
Okay. So basically here x can be anything like it could be either a real number or it could be a dimension vector depends on what we take. All right.
Anyways so here the next thing is what is the experience? So the experiences by which this function f will be learned.
So we need to provide a data set that we do in machine learning. So basically it's it's basically a data set B which has let's say n input output pairs. All right.
So it has n input output pairs where this capital n is called as the sample size. All right. It is called as the sample size. The next thing is the performance measure. What performance are we measuring? Are we trying to classify something or are we trying to predict some or values? So that is called as regression. So P stands for the type of output we want to predict.
Type of output we wish to predict.
All right. The type of output we wish to predict. Okay. So let us touch a bit upon classification.
Now what's basically going on in classification? I told you about this X and this Y, right? So in classification the output label space the output labels that basically I'm trying to predict the output label that I'm trying to predict or the Y will be one of some C classes right so how we write it is and of course they are mutually exclusive so Y can take any value from this set and of course it's mutually exclusive and unordered All right.
So it can take one of those capital C classes. All right.
Now uh when we want to predict some label given some input X that has a very common name in machine learning literature which is called as pattern recognition.
All right.
Okay.
Okay. And if this y can take any two possible labels let's say 0 1 or + 1 minus 1 then that is called as a binary classification. Okay. Now let us try to understand this thing with a very basic example. So there is a data set that is called as the iris data set.
Okay. And in the iris data set what we have is we have uh certain like uh features uh basically some petal size, petal length, petal width, sele width for uh flowers and there are three categories of flowers. So basically you have like uh the label space y has three uh possible classes. One is called as the stosa. Another flower is called as versic color and the third one is called as virginica. All right. And uh given those features we want to predict uh which class or which type of flower it is. So basically what we have is that uh when we uh if I just let's say here there's an important point. Let's say if I just give the image of the flower to a a machine learning uh system, right? So what is the dimension of the input? So you know that images are like you know so their height cross width pixels times the number of channel. So the dimensions would be very large right and uh but if we know like what are the relevant features to give it to the model so that it is able to properly classify among these three flowers then that is a better option. So there was a famous botanician who found out that there are four features petal length petal width select and sele width. Now he found it but today in machine learning and in deep learning these features are learned automatically. So we'll talk about that later but for now uh we know four features. All right. So basically what we get the overall design matrix like the data set that we get X is basically an N cross D design matrix. All right we give it the name design matrix.
So to visualize how this thing looks like you have basically a s a kind of a tabular data all right where let's say you might have this as an index all right and here you might have the features let's say f_sub_1 f_sub_2 and this is the label so let's say for sub index one you have certain features all right and then there is some label and because there are three so we can encode it like stosa could mean zero vericular one virginica 2 so you could I have like different labels for different indices.
Now this is basically tabular data right and here you see that every row is a example. It's a feature vector right and we have n such uh I mean examples and feature dimension is d this is the index and the labels but this is the feature vector out here. All right and here you see that we have tabular data. So the feature vector is fixed. But what happens if the number of features for every example is not fixed right? In that case that is I'm saying when we have non-tabular data. Okay. So if we have non-tabular data then we have to do something called as featurization which we will discuss later on when we study document classification.
All right. And one approach that they use is called as the bag of words approach which we will study later on.
Okay. Now the next thing that uh you have commonly heard in machine learning is a term called as exploratory data analysis.
So exploratory data analysis. All right. exploratory data analysis.
Okay.
Now, uh why do we need to do some data analysis? Like why not just take the input and apply the model? So, there are two obvious reasons. First is that the data set that we have the features might have some patterns, you know, like some features may be correlated. So, we may not need to use many features, right?
for instance redundant features right so expert analysis is done to find out like the obvious patterns in the data and how do they do that so if we have less number of features for example we can do some pair plot to see how each feature interacts with some other feature all right but in higher dimensions we do something called as dimensionality reduction which we will discuss later on but for now you can just know the terms So dimensionality reduction where we just reduce the high dimension thing into its uh 2D or 3D principal components because 2D 3D is uh easier to visualize than a very highdimensional thing. All right. So dimensionality reduction.
Okay.
All right.
And the other pro and the other thing that EDA is needed if we want to let's say if there are some problems with the data like let's say we might see that some features are just uh uh noise or you know so we might discard that as well. Okay. Now the important question is I told you about this thing that okay we want to classify each of the whether an example belongs to stosa vericle or virginica. But the question is how do we do the classification? So we want to learn we want to learn a classifier right because that is what we want to do in image classification or pattern recognition is we want to learn a classifier.
All right. So learning a classifier is what we are interested in.
So let's try to develop the uh thought approach that how we can do that. All right. So I told you that for the iris data set we can have four features.
Let's say petal length, petal width, sele width. All right. So let us say that when we made some kind of 2D plot.
All right. Let us say this is the 2D view and let us say if I have petal length over here and let's say petal let's say petal width over here. Okay.
And let us say this these points are characteristics of stosa.
Let's say this is characteristics of versol.
And let's say this is symbolic of virginica. Right? So common sense tells me or what classifier learning means is that we want to learn a decision boundary or nested decision boundaries. All right. So basically we want to learn a decision or nested decision boundary.
So let us uh say that uh what does common sense tells me that if I want to separate Satossa from the other two classes what what could be a decision line something like this. So points to the left of it is stosa and points to the right of it is non stosa. So how we can write it like mathematically if I have to write let us say this function f is what the uh my model is learning or what the mapping function is that takes the input x and the function should have some its own set of parameters. All right. And let us say that that should give me Satossa if the petal length is less than some value. Let's say this is alpha. So better length less than alpha otherwise it is versol or virginica. All right.
So observe one thing we need to make a further split. That's why I mentioned nested decision boundary. All right. So another decision line could be something like this.
All right. So you see that of course we will mclassify some examples because they're not linearly separable but we are making some kind of uh progress in our journey. All right. So we can split it further. So basically this condition set to a petal length less than alpha and this threshold this will be called as the parameters. Okay. And these parameters are what I must learn. I should not manually check for these conditions. The machine learning algorithm must be automatically be able to learn these parameters. All right. So we want to automatically learn the parameters.
automatically learn the parameters.
All right, that is what we want to do.
Okay, so now um I mean let us observe it with some uh decision tree diagram so that it will be more clear to you all.
So let's think it this way. Suppose I have a node. All right, just to visualize this thing in a a bit better way. All right, and here what I can have is let's say the petal length PL less than some alpha. Right. So this is the uh feature and this is the threshold. It should be less than that.
And initially we have 150 samples. So let us call it as s. So we have 50 for stosa, 50 for versol and 50 for virginica right and what is the class that I want to split on stosa.
So this is the stosa. So if this is a node basically and if this condition is true like here I got then of course what we are left with is that our vector of counts this is a vector of counts that I'm writing this has 50 samples and it makes a perfect like division. So we will learn all these things in decision trees when we study decision trees. All right. So this is class secosa.
But what if it is actually false right?
In that case if this is false in that case we are left with 100 samples. All right. So we need to make a split now. So we split on the basis of petal width. Let's say petal width is less than some let's say beta. All right.
And in that case we have like 050 that is what the initial setup is. And let's say this is the condition for virginica. So this part is virginica if it is less than beta. All right. So if it is true then basically let's say we got um 46 45 for uh let's say versol and we got let's say five. So this is vericol this is virginic. All right. So we got 45 for versol but of course because it is not linearly separable we got some examples but still like it's still better not bad so we have like 50 samples here let's what we've got and the class is versolor all right and the rest of it will go to virginica all right so 545 to balance out the math All right.
Now what we intend to do is that these parameters like petal length less than alpha se petal width less than beta.
These parameters should be learned automatically. That is what is our goal that we want to do. And now we will discuss how do we learn actually these parameters. Okay. So let's write down the next topic.
Okay. So I think this part is clear.
Okay.
So now we study something very important which is called as empirical risk minimization. So I told you that we want to learn the parameters of the model theta. But we need to have some performance measure first like how well are those theta parame those like you know parameters are doing. We want to estimate some kind of loss or some kind of error and then we want to minimize that error. Right? So the first topic is empirical risk minimization.
So we write empirical risk minimization.
All right.
Empirical risk minimization.
So we want to come up with classification models. We want to measure the performance like how well is my model doing? All right. How well is my model doing? So how can we measure let's say if it is a classification problem what could be one way mclassification rate could be one way right so we have something called as mclassification rate so firstly what is this empirical risk minimization means so empirically means that we are actually we do not know the true distribution ution from which we are uh getting our data right that is unknown to us but we can draw some samples from our distribution.
So we draw it we observe the samples and on the basis of that we compute the loss that's why we use the term empirical.
All right. So basically the mclassification rate on the training set because that is what we're given right so we will compute the mclassification rate on the training set right so the mclassification rate on the training set can be written as L of theta is equal to okay so let's uh understand how we should get so firstly I should consider all possible examples that I have right and How will I do that? So we have an indicator function.
So what is an indicator function? An indicator function of E is one if the condition E is true. That is whatever the condition here is true and zero otherwise.
All right.
So what I can write is you see I have my data set XN YN the for the input XN YN is the gold standard true output right.
So f of xn this f is what my the function mapping is what that my model has learned right.
So this is the model prediction and if it deviates or if it does not give me the same label as the gold standard output y then of course we should make it one because this is a loss it is measuring how much error it should make and then we have to normalize it as well. Why? Because if there are thousand examples and we get 900 correct then that is better than uh getting let's say I mean uh two correct that out of 10 right so we have to take the normalization of that so that's that's why 1 by n all right so this thing is called as the mclassification rate on the training set all right but the thing is that see here what I have said here I have said that it's it's one if it's not equal or zero. So here we are not giving uh uh importance to features right for instance let's say for my three features stosa versicolor and virginica let's say virginica happen to be a bit unhealthy right or a bit poisonous so if we mclassify virginica as let's say stosa that is we eat a poisonous flower then that is even more detrimental so this should not ideally be an indicator function it could be a we could give more weights to the uh you know mclassifying virginica. So that is why the general notation we use is an asymmetric thing like L for the loss right. So that we write it as L of theta as 1 by N. This is nothing uh tough.
It's just instead of indicator we we are writing L. And what we have we have uh the true label output Y in compared with the prediction. This is some function.
We will discuss about what these functions can be. There can be many of these L functions. But think of it as some kind of loss function. All right.
That measures the deviation or difference between these uh true and the predicted output.
All right. Now what I said is that we want to learn those parameters, right?
And that is what we have uh what we call as the problem of model fitting. Okay.
So the problem of model fitting or the data fitting is basically it's an optimization problem because I have to minimize this loss function. All right. So what I can do is basically uh what we'll be doing is we want to find uh some set of parameters let's say theta hat that argument stands for finding the argument of the thing that minimizes this loss function. All right, because the thing is that if I can get a minimum loss on the training set, that means we can see that the prediction is following the uh true standard of golden output label output Y in a bit. Well, right. So that is what it is and this is what is called as the empirical risk minimization. All right, this thing and putting this equation there is what empirical risk minimization actually is. All right. So empirical risk minimization equation becomes this thing. So just writing it down.
So basically loss y comma f of xn parameterized by theta.
All right. But you see just minimizing the loss on the training set is not the actual goal. Right? What do I actually want to? I want to minimize the loss on the uh true distribution from which I sampled my data from.
Right? So the true goal in machine learning in supervised learning or machine learning is to minimize the expected loss minimize the expected loss on future data.
that we will sample from that distribution. All right. So this is okay. We are finding some theta hat. But later on when I talk about uh validation and test then we will discuss this in more details. And this thing minimizing the expected loss on future data meaning that if I just sample like you know infinitely many data from my true distribution in future then I should have theta such that I minimize the loss on that thing. All right and this thing is called as generalization.
So that is uh we why we say that in machine learning we want models that generalize not models that just consider just the training set.
The hope in machine learning is that if I minimize the loss on the training set, the theta hat will be a good uh estimator or a good predictor for the future samples that I will draw from my training distribution from my true distribution. Right? Okay. So that is what to generalize.
Okay.
Now the next important uh topic that we should discuss is the topic of uncertainty and the topic of maximum likelihood estimation. All right. So let me erase this so that I get some space.
All right.
Okay. All right. So, uh I told about that we approach machine learning from probabilistic perspective. So, machine learning is inherently stoastic, right?
So, let's say if I toss a coin, then I do not know right exactly whether it will be a head or a tail or not, right?
So, there is some kind of uncertainty like even if it is cloudy today, whether it will rain tomorrow or not, I do not know. So, machine learning systems are stoastic in nature. they are not deterministic that okay for this input I should exactly get this output right so that's why we need to consider uh uncertainty because we build our ML models considering uncertainty considering that the real world is not truly deterministic so uncertainty all right so uncertainty can actually be of two types all Right. One is called as epistemic or model uncertainty.
Model uncertainty.
And this happens because we do not really if we do not really know the exact input output mapping like what the function approximator is. Of course, so then we cannot really uh perfectly estimate it in many cases, right? what the exact input output mapping is. So we have some uncertainty in the model, right? So lack of knowledge in input output mapping.
Lack of knowledge in input output mapping.
There is another kind of uncertainty which we call it as altoric or data uncertainty.
aliatoric slash data uncertainty.
So let's say even if I perfectly know what the model is, let's say if I know that if I toss a coin, the coin is unbiased, meaning that there is a 50% chance of getting a heads or a tails. So I know what the distribution is like it's like 50/50. But even if it's 50/50, I do not exactly know right whether the next coin toss will actually be ahead or not. So here we actually have the data uncertainty and this data uncertainty is actually an intrinsic irreducible uncertainty.
All right. uh basically in the stoastic input output in the stoastic.
So by virtue of stochasticity of the input output mapping by virtue of stochasticity or the you know some kind of randomness in the input output mapping.
So we want to build models considering this uncertainty.
Build models considering uncertainty.
And how do we do that? So that means we need to consider conditional probability distributions to because we want to consider uncertainties when we want to write it mathematically analyze we should write it as in the form of conditional probability distribution because that is what probability distribution gives and we condition on some input. So uh we need conditional probability distribution. So to capture this uncertainty we need to use conditional probability distributions. All right. So how do we write it? Basically we can write it this way that let's say um what is the probability that my output label Y takes on some specific class C given my input X. All right. and given that it has some you know set of uh basically parameters theta. So f is the predictor function that I'm learning and basically it tries to predict this class small c. So this is a probability value right because this is a probability this is a conditional probability and this is what we're trying to predict from my model. So this should also be a probability and if it has to be a probability it has certain constraints right what are they that this should be between 0 and one for all classes C that it that my label can take and also the sum over all possible classes should be one probability distribution sums up to one 12 standard probability enough right?
Okay. All right. Fine. So basically now the thing is that I am uh predicting some function f but I am putting extra constraints here. So this becomes a bit tedious uh you know because uh I am predicting f and then I want to put constraints as well. What if we could feed this f which has no constraints on it being 0 and one feed it to some another function that gives a probability distribution right and that is what we do. So we use something called as a softmax function.
Soft max function.
And what the softmax function will do is that this f it takes. So we call it as logits. Okay. So this it takes and then basically it converts this a this vector. This is a vector right? Because you can have for uh this can take you know like a1 to a c and for each of these like you know you have f_sub_1 up till f capital c right so it converts each of these logits into a probability distribution or this entire thing a vector becomes a probability distribution that's what the soft max function is doing all right so if I write the soft max mathematically so soft max of uh basically ically to be rotationally correct this A is a vector. All right.
So we can write it as so a C prime E to the power A capital C. All right.
So you see that each of the uh terms each of the parameters that are there so each of the terms basically is between 0 and one and the sum of all these is one.
So we have made a or this prediction into a probability distribution.
We'll discuss more about like why we are you know having in this probability distribution way later on.
So what we can write is this equation that I got given x parameterized by theta I apply the soft max function and for we are doing it for the cth class soft max c f of x parameterized by theta. So now we have got a probability distribution on both the sides right and f can be f is just my prediction.
We are not constraining f to be between 0 and one. The softmax will take care of it and it will convert this f to a probability distribution. All right.
Now uh when we have these uh functions f how we can write it. So this function in terms of let's say the parameters what could be the kind of functions there can be various like we will study f very in very detail but a simple way is to consider some linear function all right so w and x are vectors so x is rd w is let's say dimensional vector and bias is scalar right now this b term is happening and when we do when we show later on like this introduces a lot of you know mathematical clutter we do not want this B and then again if we multiply something further term so we just kind of absorb it here and how do we do basically we bring in the B inside the vector W prime because we write codes for it so we can just have the first index as B and we prepend a one to my feature vector all right so these are the features this is one example and this is the feature vector.
This is the weight vector. So now this actually becomes equal to W tilda transpose X tilda. All right. So just to get rid of notational clutter basically. Okay. Now the next topic is the last topic that we consider for this part is maximum likelihood estimation. All right. So maximum likelihood estimation.
All right.
Okay. Now you see we talked about negative log likelihood, right? We talked about we want to minimize the loss function. We want to minimize this thing I wrote, right? So we told that we want to minimize the loss function previously.
So maximum likelihood estimation here we'll just give a very intuitive idea. But uh what the thing is that when I'm trying to fit probabilistic models right so what do I want that when I try to fit a model then it should be that if I give the label if I give the input xn to my predicting function f then it should predict y right so what I can say is that if I write this term the conditional probability let's say y f of xar. So given my input x and my predictor what is the probability that I observe the true label yn.
Let's say my model learned some function f. Now what I want to test I want to test is that given this how likely will I get my true label YN given XN as input. Now this should be highly likely because YN is the label for the input XN. So this value should be very high. I mean by high mean I mean it should be close to one, right? It should maximize the likelihood of seeing my true data.
And then when we'll study KL divergence in decision theory, we will understand why we put a log here. Right? So this is log and we do it for like all the examples.
So this is called as the likelihood function. All right.
This is the likelihood function. Now think this way. What do you think should be this term should be? If I write it this way, I write let's see let me write it as L of Y f of X parameterized by theta.
So this is negative of the likelihood.
Hence this term is called as negative log likelihood. And of course we always normalize.
So this term that I write N l of theta. So let me write it over here. So negative log likelihood is equal to - 1 / n.
All right. Log e of yn.
So given that I'm uh taking my input xn and making a prediction, how likely should be my true output y? and this probability should be high and I take a log of it and a negative here. So think this way as an example that let's say if this value let's say for some xn or let's say for some particular n let's say it was 0.9 and let's say for another case it was let's say it was let's say 0.01 01.
So you see if I take log of 0.9 it will be closer to zero right it's not very far from zero some negative slight small negative value and the minus of it will give me some small positive value but if it is very low then the log of it gives me what large negative value and negative of that will give a positive. So negative log likelihood is the loss right and higher value means higher value of this higher magnitude means it's doing poorly right. So this is a way to measure the metric and that's why you see if I try to minimize this entire thing the negative of this I mean uh I mean the entire thing let's say then if I minimize the negative of this entire thing is like maximizing the minus I mean what I'm trying to say is that if I do this if I minimize this thing I find some theta hat So it is as good as saying r max of minus of this and what is minus of this entire thing? It's the likelihood. So maximizing the likelihood.
So r max of theta 1 by n summation given f of xn.
So the goal is to make the uh true output as likely as possible. All right. So this is all for the first lecture. Thank you.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K viewsβ’2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29











