This video provides a clear and structured synthesis of machine learning fundamentals, making it an efficient primer for building a solid conceptual foundation. It effectively distills complex topics like feature engineering and model evaluation into digestible, actionable insights.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
02 All about Machine Learning (ML) | Basics of Machine Learning | Classification & Regression ModelsAdded:
Hey guys, hello and welcome back to the second video of our series Gen AI for data engineers. Today we will discuss about machine learning. By the end of this video, you should be able to understand the terms involved in machine learning. The core focus of this particular video is to make you understand that how models in machine learning works. We are not going to build a model from scratch rather we are going to understand how simple machine learning models work. So without any further delay, let's begin. Everything in machine learning starts with data.
You can think of data as fuel for machine learning. So if machine learning is an engine, then data is the fuel for machine learning. You can't train a model without data in machine learning.
So data is the primary thing that you need for machine learning. Consider this is a data set. Now we would use this data set in order to train a model which can detect spam emails.
In this data set we have few columns here. You can see there are around four columns. The first one is contains free.
The second one is number of links. The third one is all caps percentage and the fourth one is spam. If you notice here this is something that we want to predict.
And these are the columns that we would use in order to predict this. So this acts as a input for the machine which are known as features and the output column which the machine learning model has to predict is known as labels.
In a training phase, we provide all these four columns to the machine learning model and we tell it that this are the features based on which this are the labels. So you have to train yourself based on this features and on this output and once this machine learning model is trained, we would only provide this columns which are features in order to get the prediction which would be the output from the machine learning. And this is how a simple supervised learning model works.
In case of supervised learning, we provide both the input and the output to the model first to train.
And once the training for that model is complete then we would use only the inputs which are features in order to predict the labels.
So this are the two important terms that you'll find always in machine learning which are features and labels. And now you know what are features and what are labels. Features are nothing but the input columns that the machine learning uses for prediction. Labels are the predictions. And while training a supervised learning model, you provide both features and labels to the model to get trained. Once the model is trained in order to test, we only provide the input which are the features to this model to get the predictions which are actually the labels. Every machine learning model follows a machine learning workflow. You can also say a basic cycle of steps and on my screen you can see there are around eight steps that I have defined. The very first step is define the problem. What you are actually trying to achieve or what you actually trying to predict. In our case, the problem is predicting the spam emails. The next step is to collect and prepare data. We have already seen the data set that I have been using. But there is one important point to note.
Machine learning models works on mathematical formulas or mathematical functions.
And mathematical functions or formulas does not understand text data. So there are multiple steps that you have to perform in order to convert text data into numerical format.
Also there are certain cases where you have to clean data, handle missing values, encode into different categories, even split the data into train and test sets. It means if you have 200 record of data set in order to train you would split the data into for example say 150 records in order to train and 50 records in order to test. So you also have to split your data set in order to train your model and test your model. So that is the very second step which says collect and prepare data. The third one is what is the model that we are going to use. There are various kind of models that are available and there are various factors involved in order to choose those models. I am not going to cover all of them today but I'm going to show you some of the basic models and how they work. The fourth step is to train the model. Once you have splitted or prepared your data, then you have to use this particular model and you have to use your training data set in order to train your model. Again, if you are using supervised learning, then you would provide both features and labels in order to train your model. And we already know that the features are the column based on which you would predict the output which are [clears throat] actually the labels. Once this is done, the next step is to evaluate whether your model is working perfectly or not, whether the accuracy of your model is good or not. And in order to do so, you would use this test set that you have splitted initially. And you are only going to provide the features.
And you would see what the model predicts as labels.
And then you would use this labels with the actual labels of your test data to see what is the accuracy of your model.
And once you have determined the accuracy, then you would see if you would again have to tune or improve your model or not. Once that is done, you will do a final test on your test data and you would see what is the final score or the accuracy of your model. And once you are satisfied, then you would deploy your model for your realtime use cases. It means you will deploy your model to production for actual users to use it. In our case, you deploy the spam email model for the users to use it and detect their spam emails on their mailboxes. This is a basic ML workflow or basic cycle that each machine learning model follows. Before I can dive into different other topics, it is important to understand the train and test split. Consider you have around 200 records of data to create your machine learning model.
Then it is important that you split this 200 records in two parts. The first one would be training data and the second one would be testing data. It is important that your training data should be at least 70 to 80% of the data and it should be the diverse record set. It should not be a repetitive data set because in case if it is a repetitative data set then there would be a high chance that your model would be overfit.
It means fit to a certain type of data.
Okay. So it is important that you have a diverse set of around 70 to 80% of your data set and remaining 20% would come in test and your model would have never seen this data. While training it would use this data and while testing you would use this data. So your model would have never seen this data. It would only be used in order to determine the accuracy of your model. It simply means that while training a supervised model, you would provide both features and labels of your training data set.
And while testing, you would only provide the features of your test data set. And you would ask the model to predict the labels.
And then you will check this predicted labels with the actual labels.
and then you would determine the accuracy of your model. In case the accuracy of your model is low, in that case you have to again train your model on different diverse set of data and again use a different set of testing data in order to determine the predicted labels and the actual labels and again check the accuracy of your model and this is the importance of your train and test split of your data set. The next and the most important topic of today's session is supervised learning. Till now we have already understood that in case of supervised learning we have to provide both features and labels to train our model and in order to test we only have to provide the features to our model and based on the predicted labels we would check this with the actual labels and determine the accuracy. Now based on this label supervised learning can be qualified into two types. The first one is classification and the second one is regression.
In case you want to classify or categorize your output labels. For example, in our case it is either spam or not. So our output label would only be categorized into two parts. Either it would be spam or not. In that case, those kind of models are known as classification models. In case of regression, you are trying to predict an output. For example, what is the price of an house that can be say INR 1.2 CR, right? So you're trying to get an output which is not a category in case of a regression model. So regression models are often used for use cases where you are trying to predict what would be the price of a stock at a particular given date. What would be the price of a house in a particular area? What would be the price of a house based on its size. In those cases we are trying to get the output as a value which is not categorized. Those are regression models. In case of classification you're trying to categorize. For example, in our case, the output can be either spam or not spam. Similarly, if you have trained a model which can predict either it's a cancer or not cancer.
Similarly, you can have multi labels as well. For example, if you're trying to predict a sentiment which can be either positive, which can be either negative or it can be neutral.
So if your output labels are categorical then they would come under classification. If not if you are trying to predict a value or a number as a output then that would come under regression. And this is the simplest difference between classification models and regression models. And we are going to see some of the popular classification models and some of the popular regression models. If you remember the ML workflow that we discussed few minutes back in which the first step was to determine the problem in our case it was spam email classification. The second step was to collect and prepare your data and the third step was to select a model. Since we are doing a classification problem where we have to classify whether our emails are spam or not. In this case, our model should be a classification model. And there are many common classification models that you can choose for this problem. And the very first one and the common one is logistic regression. The name says regression.
But remember this is a classification model. This model works simply by computing a probability between 0 to 1.
So in our case if the score is less than 0.5 then it would classify this as zero which can be not spam and if the spore is greater than equal to 0.5 then it would classify this as one which would be spam.
Now this is what we have defined. If the score is less than 0.5, we would classify this as zero. Now why we have classified this as zero because remember models does not understand text data. So we would just say if it classifies this as zero it means it is not spam and if it classifies this as one then it is spam. So in case you get a score of 0.8 8 then we would classify this as one it means this is spam.
Okay. And in case you get a score of 0.3 then we would classify this as zero which means not spam.
And this is how simple logistic regression works. Let's take an example of the data that we have. We have some of the feature columns here like contains free, number of links, all caps and there is one label column that the model has to predict. And now the model understands each of the input columns it has to take and what it has to provide as output. And based on the input columns, it understand which input column is important and based on that it assigns some weights.
So in this case you can think of contains free is heavy weight because if there is a word called free there's a high probability that that would be a spam. Number of links might also have higher weight because if there are too many links into an email it can also be a spam but all caps is not mandatory that the email might be spam. So it might have some lace weight. So consider this very first row. So it has yes. So model assigns a score of say 2.3 for this. Okay. Similarly, it has say the number of links as 12. So it is also high. So it would also assign for example 1.8 as for this 12. And for all caps since we have less weight for this and it is around 45%, for example, it assigns 0.9. Now if I add all of this number, this would come at around five.
Now remember, I've just taken an example here. These are random numbers based on weights because this column is much more important than all caps. So it would have higher weight. It means the score that the model would assign for this yes would be higher. Right? So here we will get a value as five. Now this value would be passed into a function called sigmoid.
Now this is the logistic regression function which would map any value between plus infinity to minus infinity to 0 to 1. It means you can have any value but this function would map that in between 0 to 1. Consider the sigmoid function maps five as 0.87.
In that case [clears throat] if you notice the score is greater than 0.5. It means it classifies as one which means this is a span.
Okay. In our case again this is spam. So now you understand how a logistic regression works. This is the important function which is the sigmoid function here and this is how a normal logistic regression model works. Now all of the numbers that you see here these are not actual numbers. I've just taken this as a simple example to make you understand how sigmoid works. The only baseline that you have to understand is once the features are provided the model trains itself on the features and identify what would be the weights for each of the feature column because each of the column would have their own importance and based on the importance the model would identify and assign different weights to each of the columns and based on the weight it would calculate a score which would be passed into the sigmoid function in order to predict the value between 0 to y. Now you can decide if the score is less than 0.5 you can classify this as zero. If the score is greater than 0.5 you can classify this as one. And based on that we are predicting whether a particular given email is not spam or spam. Decision tree is another type of classification model in which a series of yes or no questions are asked in order to reach to the prediction. In our case, consider the first question is does the email contain free?
If this is yes, then the flow would happen on this side and if this is no, then the flow would happen on this side.
Now the next question that would be asked if the email contain links greater than five. If this is yes, then the flow would happen on this side. If this is no then the flow would happen on this side.
In case it is yes it would mark this as spam. If it is no then it would mark this as not spam.
Again on this side this would again check for example if your email comes from a known contact.
Okay someone whom you know sent you this email. If this is yes then there is high probability that this is not spam.
If it is not coming from a known contact or it is coming from an unknown contact then there's a high chance this is a spam. And now if you notice this is a series of yes no questions which has been asked in order to reach the prediction and [snorts] this is how a normal decision tree works. The third and the last classification model that we will discuss today is random forest.
Now random forest is a collection of decision tree models in which multiple decision trees are trained on different set of data.
So consider this is random set one on which decision tree would be trained.
Okay. Similarly random set two decision tree two would be trained on this.
Similarly on some random set three decision tree three would be trained and say till five we have trained different decision trees on different random datas.
Now once this training is complete then you would use an email for prediction.
So once you pass this email to this random forest then each decision tree would provide their own output. For example, that first decision tree provided the output as spam. The second has not spam, the third one as spam, the fourth one as spam and the fifth one as not spam. Now in this case it would be seen what is the maximum vote. For example, in this case you can see there are three spam votes and this simply means that the email that has been provided is a spam email. In this case, you can just think of instead of asking to a single expert, you are asking to multiple experts and based on the majority vote, you decide what is the outcome of your test. And this is how simple random forest classification model works. If you have noticed, I have just explained the concepts of each of the classification models. I have not shown you how to create this models and that is not the core focus of this video. The main focus of this video is to make you understand that how this different classification models work so that next time when you hear the term called random forest you know how it works and that is the sole purpose of this video. If you go back to the ML workflow in the steps once we have trained our model we have to test our model. Once we have tested our model, we will compare the actual labels of our test data with the predicted label of our test data and then we will determine the accuracy of our model. In case of classification model, we use something called a confusion matrix in order to determine the accuracy of a model. In this case, you can see the actuals here on the y-axis and the predicted here on the x-axis. Here P stands for positive, N stands for negative. It means the actual was positive, the actual was negative. In case of P here it is positive. In case of N here it is negative. It means the predicted was positive and the predicted was negative.
In our case of spam classification, positive means spam and negative means not spam.
So consider a case your actual label of test data was positive. It means the test data said it was a spam and the prediction also said it was a spam. Then the output is true positive.
Now consider a case your actual data said it was spam but your predicted data said it was not spam then this is false negative. Now your actual data said it was not spam but you predicted it as spam then this is false positive.
And in case both of them predicted it as not spam then this is true negative.
And in order to determine the accuracy what you have to do is you have to calculate the true positive plus true negative by the total number of your observations.
So once the testing is complete you have to see which all are true positives. It means where the model has detected it as spam and it was actually a spam and where the model has detected it as not spam and it was actually as not spam. So you have to determine both the true positives and the true negatives and you have to then divide it by total and this would be your accuracy percentage. So consider a case we had around 20 records for our test data set for our spam email classification model. In that case we got around true positive as say 12. It simply means that both the predicted value and the actual value for the record are spam. And for example we got the true negatives. It means both the actual value and the predicted value from the models were not spam. For example, this were around five and rest of them didn't match. It were either false positive or false negative. In this case, our accuracy would be 12 + 5 by 20. That would come around 17x 20 which would be approximately 85%.
So the accuracy that we have achieved for our classification model is around 85%. And this is how you determine the accuracy of your model using confusion metrics. Now that we have learned a lot about classification models, let's learn about regression models. In the beginning, I've already explained that regression models allows you to predict continuous numerical values. For example, you want to predict what is the price of a house. So you put the features inside the model and the model predicts a number. For example, INR 1.2CR.
Okay. In this case, you get a number as output. This is why these are known as regression models. This does not provide labels as categorical data. It means your labels are continuous numbers but not fixed set of categories. In case of regression models, we'll learn about two of them. The first one is linear regression.
And the second one is polomial regression.
The simple formula for linear regression is y = mx + c. Okay, this is a formula of simple straight line. And that is the formula of linear regression. In case of polomial regression, it looks something like this. y = m1x + m2xยฒ + m3xq + c. Okay. So this is a polomial formula which can be formula of a curve. So these are the two different regression models that are being used in order to predict numerical values. Consider we want to predict the price of a house based on its size. In this case you would have something like this.
In x-axis you would have size. In yaxis you would have price. Now you would provide different features for this model to get trained. Right? Consider this are the data for size.
Based on the data or features that you provide model would train and determine a line in between. Okay. This would be your y = to mx + c. This is the line that the model has determined. Now if you provide a size for prediction, it would determine the price based on this line. So this would be the predicted price.
Okay, based on the size that you use for prediction. And this is how a normal linear regression works. In case of polomial, it would try to build a curve like this.
So the model would build a curve like this. So if you provide a size for prediction, you would get a price based on the model curve. Okay. So this would be your predicted price. And this is how simple linear regression and polomial regression works. In order to determine the accuracy of a regression model, there are four different ways how you can do it. All of them simply checks how far is your prediction is from the actual value. For example, for a 1200 square ft the actual price is 80 lakhs and the model predicted 1.2 lakh. So it is around 40 lakhs difference. Right? So this is how the difference has been calculated by four different formulas here. I'll not go deep in all of them.
You can find the formulas on internet for all of them. But I'll explain you what exactly all of them are. The first one is mean absolute error. It simply means it is the average of the predicted minus the actual. The second one is mean squared error. It simply means it is the average of the square of the predicted minus the actual root mean square error.
It is the root of MSE.
And the final one is R square score. It is 1 minus MSE by the variance of the actual. And this are the four different metrics that are actually being used in order to see how far off your prediction was from the actual and based on that you can calculate the accuracy of your model. The very next kind of models other than supervised models are unsupervised models. In this kind of models, we only provide features in order to train the model. We do not provide the labels.
The model has ability to identify different hidden patterns from the features and based on that it learns the mappings. One of the most popular type of models that are used for unsupervised learning are clustering models which are used in order to cluster different type of data. You can also say this are grouping models because they identify and group different kind of data based on the features and the hidden patterns.
Because there are lot of data in this world where there are no labels provided for the data. It means someone has not identified what would be the label for those kind of data. In those cases we use unsupervised learning technique in order to train models and the popular type of models under unsupervised learning are clustering models. I'm not going to explain you all of the models which lies under clustering models. But one of the most popular is K means clustering.
You just need to know that one of the popular clustering model is K means clustering which is used in order to train models using unsupervised learning. What actually happens in the background of this clustering models are if you have various unlabelled raw data, it actually find patterns and try to group them. Okay, so it would create different groups or clusters out of the raw data that you provide and this is what this clustering models does and one of the popular such model is K means clustering. Now this model is very popular because this is also used in case of embeddings in LLMs and we will discuss more about embeddings when we learn LMA. For now the final kind of learning that can be used in order to train the models is reinforcement learning. Again in case of reinforcement learnings we only provide features in order to train a model based on some trial and error. The model provides an output. If the output is correct, then we reward that model.
In case it is wrong, then we do not reward or we penalize that model. Based on this reward and penalization, it understands what is the correct pattern and what can be the correct output and based on that it trains itself. And this is called reinforcement learning. Now you might be thinking that reinforcement learning looks just like supervised learning. In case of supervised learning, every time a right answer has been provided while training your data. But in case of reinforcement learning, no right answer has been provided. Sometime you get reward, sometime you do not get a reward or sometime you get penalized. And based on that the model gets stringed. For example, if you have an agent, you ask it a question.
And once the agent answers is correct, you give it a feedback saying positive but the answer if it has provided is not correct then you provide a feedback saying negative and based on this feedback positive or negative the agent understand what can be the correct answer for a given question and based on that it automatically adjusts itself and train itself and this kind of learning is called reinforcement learning. Let's talk about one of the other important topic which is dimension and curse of dimensions.
In machine learning features are also known as dimension.
It simply means if you have a data set which contains two features then your data has two dimension for prediction and in case it has 10 features then it has around 10 dimensions for prediction.
The more the number of features or dimension, it is not guaranteed that your machine learning would start performing well.
This is called the curse of dimension.
It means if you provide too many features or too many dimensions in order to predict then your machine learning would have bad performance. Let me give you an example. For example, if you have lost your gold ring and you want to find this on a road of say 1 kilometer. So it is pretty easy. You can just start from a point and you can end in a point. So you just have to move in one direction and you can find your gold ring. But consider you have lost your gold ring on a field where the size is around 1 kilometer into 1 kilometer. In this case, if you start searching your gold ring, it would take more time, right? Because it can be anywhere within this area. Similarly, consider if you lose your ring in a three-dimensional space in a cube or say in a room, it can be anywhere. In this case, you have to go through the whole 3D space in order to find that ring. So now, if you notice, if we increase the number of dimension, the chances of finding the ring decreases and it takes more time in order to find that ring.
Right? So that is the curse of dimension. If you start adding too many dimensions in your data which are not kind of use for example those are useless dimensions or features and you add them for your prediction in your data set for your model then there's a very high chance that your model would not perform too good. This is why in machine learning it is important that you only add those features or dimensions which are important for prediction. Consider you have an e-commerce data set where you have around 100 dimensions or features in order to predict an output from that e-commerce data set. In those cases, we use something called a PCA which is principle component analysis.
This is a kind of model that actually reduces your dimensions to the meaningful number of dimensions. So it might reduce your 100 dimensions to 10 of them and those would be the meaningful dimensions. Those can be used in order to train your machine learning and you can avoid the curse of dimension in that case. And this is an important part where you just need to know that if you have too many dimension then how you can use principal component analysis in order to reduce the number of dimensions to the number of meaningful dimensions and avoid curse of dimensions. The final topic that we will discuss today is feature engineering. Now feature engineering is very important in order to prepare data to train your models because as I already said the models are basically mathematical formulas they don't understand text plus if you have discrete numbers which are kind of bigger or smaller then the models would also not understand them properly. So we had to implement certain techniques in order to make sure that your data is prepared well and your model can understand the features well enough to make predictions properly. The first technique that we will discuss about are normalization or standardization.
Consider in your data set you have two of the feature columns. One is salary and another is age of a person. In this case the salary numbers are for example 25,000 say 80,000 but the age might be 17 18 30 right now if you notice here salaries are bigger number. Now if you provide this columns as is in your model the salary numbers would simply dominate over the edge. So in that case what we need to do here is we have to normalize both of these feature columns. We have to make sure that both of them scales within the same boundary. So we would convert both salary and age into a scale to 0 to one so that both have significant meaning for the model and no one dominates over each other. This type of rescaling and normalization of data is done for numerical data types.
The next technique that is used in order to encode categorical data is one hot encoding. Consider you have three categories of data. For example, you have color as a feature in your data set and you have three values in this.
For example, red, blue and green.
Now models does not understand text data. So we have to encode this as numbers. So what we can do is we can convert this as three columns. For example, the first one would be color red. The second one would be color blue and the third one would be color green.
And we can just place zero where it is not matching. For example, it is matching for red. So I'll place one. And for blue and green, we can place zero.
Similarly, for blue, we would place red as zero, blue for one, and green for zero. Similarly, for green, we would make both of this zero and green as one.
So now if you notice, we have converted and encoded this categorical column into three columns. And this is a technique which is known as one hot encoding. And this is very popular in order to encode different kind of text categorical data.
The third kind of fix that we do for our data is to handle missing values. Now you can have a data set where in your features you might be missing some of the values for certain rows. In those cases, you have to make sure either you drop those rows or you fill up those values using the mean of your data or the median of that particular column or the mode of that particular column. But you cannot leave those missing values.
You have to make sure either you drop those rows or you populate those values using the mean value of that particular column, median value or the mode value.
You cannot leave those as missing values because in those cases the accuracy of your model would be downgraded. The final technique that we'll talk about is feature extraction. It simply means that you would create more meaningful features from your existing feature. For example, if you have a timestamp value in your feature, then you try to create new feature columns for hour, minute or say date.
Similarly, if you have for example a text value, for example, a comment, then you try to create new features like for example sentiment score of that comment, for example, top keywords of that comment. Similarly, consider if you have a feature called address, you try to extract and create new feature columns for city or say pin code.
or for example state, right? So you are creating new meaningful features out of your existing feature. And this very important technique is known as feature extraction. You are extracting out the meaningful data from your existing features and you're trying to create new features. Other than this, there are many more feature engineering techniques involved in machine learning. For now, this four are sufficient in order to move forward in our course. In this video, I've tried my best to capture as much as information possible for machine learning. But there are few more terms still left or missing in this video. I would request you to just go back and look out for those terms. Now that you are aware of some of the important terms like features, dimensions, curse of dimensions, how to get accuracy, what are weights, we are ready to move to our next chapter which is neural networks.
In our next video we will discuss more about neural networks. Till then keep learning, keep growing and keep sharing.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsโข2026-05-29
Long-Running Agents โ Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsโข2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsโข2026-05-28
BREAKING: Microsoftโs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsโข2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsโข2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsโข2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsโข2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsโข2026-05-30











