K-means clustering is an unsupervised learning algorithm that groups data points into K clusters by iteratively assigning points to the nearest centroid and updating centroids as the mean of assigned points; the David Bouldin Index evaluates clustering quality by calculating the ratio of intra-cluster distance (points within a cluster) to inter-cluster distance (between clusters), where a lower dispersion value indicates better clustering quality as it means points are tightly grouped within clusters while clusters are well-separated from each other.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI Lecture Room KMeans Numerical Example M Aamir GulzarAdded:
Okay, so hopefully you can see my this screen came in clustering numerical example.
Okay, good.
I'm foreign student, otherwise continue.
I don't know. I think Are you there?
Okay.
So, this this is a kind of example data set.
You can ignore this column because this is just a point ID. 1 2 3 4 5.
It does not kind of any important.
And then this data set is basically a kind of data set of any brand or any you can say store which has actually recorded the customer spending in dollars and the frequency of customer visits to their store. For example, this is a customer name Elise spends $5, visits 12 time and Bob $6 spending and frequency of visits, the number of visits is 10. So, this is a kind of the data set of customer visits with any brand and customer want you to segment or create the groups using this data set and separate out these students the occasional or peak kind of segment of the customers and the premium or the loyal customers.
So, of course, if we generally see uh the premium or the loyal customer will be if it is visiting, of course, it will be making some purchase. So, the number of visits and the amount spent should be relatable with each other in case of premium and the loyal loyal customers.
And in case of students or commuters, it will be too many number of visits, but spending is very less.
And occasional or kind of the this segment of the customer will visit like frequency of visit will be very low because it's occasional. And of course, the spending will be low like that. So, these three groups that we have to create for example, company given us this data set. I just consider less number of entries for easy calculation, otherwise yeah.
Company can have around uh 100, maybe 1,000 of records of such customers.
Okay.
So, on the basis of the requirement that we have to create the three groups we know that the number of clusters in our algorithm K-means clustering will be three.
Because we need to create the three uh segments of the customers.
Student segment, occasional, and premium.
Okay.
Next, then the second step is to to initialize the random centroids. All right.
So, what we did is we consider the patient Sorry, patient and the person one, you can say are the point one on the basis of point ID.
The point one as a centroid for first segment, the students on the computers.
And the point five as a centroid for the second segment.
And the point 10 as a centroid for the premium or the loyal segment. It is not necessary to consider only these points.
It's just the random selection. Okay?
And we are using these two features, spending and frequency.
Okay? We are using these two features.
So, because of these two features, we considered these this spending as a X and this spending as a Y.
For example, just as an example purposes, if there is an other feature, so we will consider three points of each sample.
Okay? Currently, we are taking the two point values. This is spending, this is number of visits.
So, this is just to show that we can It is not that we only apply K-means clustering on the on the visually kind of uh tabular or uh XY points data set. We can apply K-means clustering on any kind of data set. If your data set even have like 10 features and all of the 10 features are important, so our this point will contain uh 10 values of each sample.
And then of course we will see that how we will manage this next onwards. We will calculate the distances and everything on the basis of all those values.
Okay, so once we initialize the random centroids, now this is your point one centroid and this was the value of point one 5 12.
Okay?
So the distance of 5 12 from its centroid 5 12 will be of course zero.
Okay? Now these three columns are actually showing the distance of every point from the other cluster. Just like the example we studied in the previous class, we will calculate each point's distance from every other cluster. So this point P1, the distance of P1 from cluster one because the centroid of cluster one is the same 5 12, the distance of 5 12 is 0 0.
The distance of this point from the cluster two uh cluster two's centroid is 15 three, so of course we will calculate this value directly uh 5 minus 15 and two 12 minus three. Just put on this formula and you can calculate square. This is basically Euclidean distance. And you will get this distance value. In the same way we will calculate this point's distance from cluster three as well and it is 23.9.
Now you compare the distance of each cluster um the distance of cluster one from this point is zero, the distance of cluster from this point is 13. The distance of cluster three from this point is 23.
So, uh which cluster we will assign to this point? Of course, which has the minimum distance. So, the minimum distance is cluster one.
So, the point one will be assigned to the cluster one.
In the same way, you will calculate for the second point 6 10. The distance from cluster one, the distance from cluster two, distance from cluster three. You will get these values 2.24, 11.40, 22.36.
Which cluster has the minimum distance?
And it is cluster one. So, of course, we will assign the cluster one to this point. And in the same way, going forward, you can calculate all the points distance from all the clusters and then assign them the relevant cluster. And here you can see we have these points belongs to cluster one. We have these points belong to cluster two. And we have these points belong to cluster three.
Then the next step is the centroid calculation.
And in this centroid update step, we will calculate the new centroid values.
For the new centroid values, of course, uh you studied in the previous class, we will use the current distances and take the mean average of that particular point.
So, these are the points that we assigned to the cluster one.
So, of course, we will calculate the mean of these points.
So, how we calculate the mean of these points?
We will of course sum the X value 5 + 6 + 4 + again 5, the X points of of all the points in this cluster, and the Y points of all these values in the cluster and divided by four. So, separately you can see 5 + 6 + 6 4. + 5 / 4 and we got the value 5 and then all the Y plus values and divided by four and we got the value 11.5.
This is what I was saying to you before this that if there are let's say three features if there is something else like you can say there is a new feature of age of the customer as well. So, we will use the three points. Of course, we will initialize the three points and here you will get the three points and then next we will be kind of Here we will be calculating the new centroid of the three points, of course. Okay? If there are four point five points, so this thing will be according to the total number of features.
So, you got these number of new centroids. Now, your new centroid for cluster one is 5 11, for cluster two is 16 3.5 and for cluster three is 27 and 14.
Now, of course, the next step is to use these new centroids and reassign the clusters and so far and so forth, but I just I'm going forward towards the quality evaluation of these clusters because you know what we will do actually we will repeat all these previous calculations again in the second iteration and third iteration and fourth iteration. I'm assuming let's consider this iteration as a good iteration and let's move forward towards the next step of David Bouldin index quality evaluation.
Okay. So, for the cluster one using the final or the updated centroids, you will calculate the distance of every point using this new centroid. Remember the formula we studied in our previous class.
I think now you cannot see the other screen. Let me share my full screen with you.
And hopefully entire screen.
Yes. So, remember this formula we used in our previous class that for the delta calculation of every cluster, we will subtract every point's value from the final centroid and take its square and then divided by the number of points in that cluster and then square root. So, you will get the delta value of that particular cluster.
So, now this value is basically these values, the S1, the S2, and the S3 is basically delta one, delta two, and delta three for every cluster.
So, what we did is this 0.5 is basically coming after the calculation of this point's difference from this final centroid. Okay? And you got this particular value. So, there are four points in this, so we will divide it by four and then we will get the final value 1.20. In the same way for second cluster, we will get 1.81 and for third, we will get 1.91.
Once we calculated the uh the And we we what we call this is basically we call it the intra-cluster similarity. And once we calculated the intra-cluster similarity, we need to calculate the inter-cluster similarity and then divide. So, these steps are missing. We will calculate it now. Yes.
They got to be. Okay. So, yes, this is what I am explaining that this these values are coming once you put the use the final updated centroids, okay?
And put in this formula.
So, you will get every point minus the updated centroid squared, and you will get this value.
Um let me show you. I think if I ask it for the calculation, I can show you again.
Yes. So, these are your updated centroids, okay? 5 11 16 3.5 This These are, okay? So, these are your updated centroids, and now we have we have to calculate the dispersion, and you know, for the David Goulden index dispersion, we need to calculate two things. One is intra-cluster similarity using this formula, and one is inter-cluster similarity using the direct differences, okay? And then we need to divide the inter-cluster similarity by inter-cluster similarity to calculate the dispersion.
So, using the updated centroid, we'll put the values in this formula. This is the same formula. You can see the square root and 1 over NK, total number of points in that cluster, and then the mean of that cluster.
So, the values are you can see This is the total number of points in the first cluster is four.
Uh you can see here.
The total number of values in assigned to first cluster is number of points are four.
And the updated centroid of this is 5 and 11.5.
To use these numbers, do you can directly put 1 over NK? 1 over NK will be 1 over 4.
And then use the updated centroid to calculate these values. And for that for the first point, let's say 5 - 5 squared.
Okay, the first point's value is actually five. You can see here. Five, the centroid of this is five. So, 5 - 5 squared and then second value is 12, Y - Y - 11.5 squared, you will get 0.25 for this point.
And then next the same for the second point, same for the third point, and same for the fourth point. So, for all these four points, you will got these values, 2.25, 1.25, 3.25, and 0.25. These are the values for this part of the formula, mean values, just mean values calculation.
Now, we need to divide these by the total number of points to get the final values. So, of course, what we will do is we will sum all these values calculated. It will become seven, and then we will divide it by the total number of points in this cluster. It will become 7 by 4, and then we will take the square root of this value, and you will get 1.32.
So, in the same way you are Dispersion, okay. So, what is dispersion actually? Let me show you from this example.
Okay.
Let me find the suitable example. Let me just edit.
Okay.
Here, you can see in this example, uh we did the K-means clustering on this data set. And these are the uh kind of original data points and these these are the clusters are the data points visualization after K-means clustering.
So, you can see this is your one cluster.
This is your second cluster values.
And this is your third cluster values.
Dispersion actually gives us two values, okay? So, to evaluate the quality of these clusters, we calculate the two things. One is the intra-cluster distance, like is there any distance between the cluster points?
And second point is inter-cluster distance. How much the clusters are away from each other? If you see this grouping, in this grouping, the intra-cluster distance means the distance of the points from each other, and the distance of this group is very low. And the inter-cluster distance, the distance from the cluster from each other, is high.
This is what we actually want. This is good clustering, this is bad clustering.
Why this is bad clustering? Because the intra-cluster distance is high, the points are not connected with each other.
And the inter-cluster distance is very low, which should be high.
So, for the calculation of this thing, we called we are calculating dispersion.
Dispersion is actually shows how much is the distance between the points from their mean within the cluster, and how much is it?
So, dispersion is basically actually the rate of intra-cluster distance divided by inter-cluster distance. This is what we call dispersion. Intra-cluster distance divided by inter-cluster distance once you are done with the clustering or K-means grouping.
This is what the formula of in dispersion calculation that we are using.
This upper part of the delta is actually calculating the intra-cluster distance and the denominator or the lower part is calculating the inter-cluster distance.
We divide the inter-cluster distance by inter-cluster distance and we get the dispersion.
The dispersion is Uh it will calculate for both clusters if there are three clusters, it will do for three clusters, for four clusters, for five clusters. And yes, um if there are more than two clusters, we will make it as a pair and pair by pair we will calculate the dispersion. Because you can see from this example, just give me a minute.
Uh from this example, you can see if we want to calculate the distance between the clusters, of course we we will select any two clusters, we will calculate this. We will select this and this and then we will calculate the distance between those these two clusters. And the other way is we will select we will select any other two cluster and then we will calculate the distance between those. So, of course if there are more than two clusters, you will make the pairs of two and then calculate the distance between those and then finally we will see that next in the example how we will finally evaluate the quality of these clusters. Yes.
Yes. Yes, it will try all the combinations.
But before going to to try all the combinations, of course you will first of all calculate the intra-cluster distances. That is what we are doing here. Because you need to use this value for different combinations, of course.
So, first you need to complete the values calculation of this formula. You need to calculate the delta K for cluster one, uh delta C1 for cluster one, delta C2 for cluster two, delta C3 for cluster three. In the same way, uh C for C1, I mean intercluster distance for C1, then C2, and then you can of course use these values for the combinations.
This is what we are doing here.
We are first of all calculating the delta values. These are the delta values we calculated. Then we are calculating the distances pair by pair, C1 minus C2, cluster one and cluster two's distance.
But we directly use their centroids, centroid one, means the final updated centroid of cluster one, which was five and 11. You can see five and 11. And the second cluster, uh the updated centroids was 16 and three. So, we calculated the dis- difference between these two and got this value. And then we are doing between the between the C1 and C3, and then we are trying the third combination, uh C2 and C3. And you got these values. Now, once you have all these values, the intercluster distance pair by pair and the intracluster distance of each cluster, now you will finally calculate the these values, and you can see here, by combi- combination by combination.
For example, for C1 and C2, we will sum the intracluster distance of the C1 and C2 that we calculated here, 1.32 and 1.94, and divide by the C1 minus C2's intercluster distance.
And then you will get the final value of 0.239.
And then for the second combination of cluster one and three, you will use their intra-cluster distance divided by inter-cluster distance. And then for cluster two and three's combination, you will use their intra-cluster distance divided by inter-cluster distance and you will get 0.253.
When once you calculated all the combinations distances, then we will see the final value of the David Bouldin index. For this for each cluster we will find the maximum similarity value. And for this of course, we are using a simple max function. We at first evaluating these two combinations 1 2 and 1 3, then we will evaluate 1 2 and 2 3, and then we will evaluate 1 3 and 2 3 to got the max similarity value.
And the max similarity value we are getting here, then of course we will be dividing these three combinations by the sum of we will getting the average you can say our mean similarity calculation for these things. And this is what this dispersion value of all these three clusters.
If this dispersion value is less, it is actually good. And if dispersion value is high, it is not good.
So, the low dispersion value is good.
Why? Because if you see this formula, this is saying that intra-cluster distance divided by inter-cluster distance.
Intra-cluster distance should be very low. That means the numerator value will be smaller. And the inter-cluster distance, the distance between the cluster groups should be high means denominator value will be high. So, of course if your numerator value is small, your denominator value is high in only in this case you you get the dispersion value low. And this is what we actually want. The low dispersion value is actually the proof that your clusters are good. And the high dispersion value is proof that your clusters are not good. I either their intra-cluster distances are high or their inter-cluster distances are very low.
Hope it makes sense now.
Yes, sir.
Uh good point, but I think so because this is basically how you calculated these these are your dispersion points. So, we are considering the maximum similarity because at this point this similarity is showing the you can say the similarity of the clusters with with each other. So, that is why we are considering the maximum similarity at this point. And this dispersion is you can say somehow the dissimilarity thing.
So, that is why we are trying to minimize this thing.
This is just because in more than two clusters we tried different combinations.
So, out of these combinations we are trying to get the max value of each combination. For example, the cluster one and two's when we were calculating the cluster one and two's values, similarity values, we got this value, 0.239.
Then we are making the combinations of these with each other. So, we are comparing the value of group S12 means cluster one two with the S13 means cluster one and third. Because there was no directly comparison made first before calculation before this calculation. So, now we are actually trying to get the maximum similarity out of these two clusters. S12 has 0.239 and S13 has 0.152.
So, out of these two values, the maximum is 0.239.
Yes. Yes, it does.
Yes, any other question?
Okay, so I will share this example with you people on the JCR.
This is just I generated from the Gemini for your better understanding step-by-step. I will try to compile it in the single document if I can, then I will share with you after this lecture.
Okay, so if there is no other question from this, then we are moving towards the next topic that is the hierarchical clustering.
So, the purpose of this example and the kind of more time on K-means clustering is that you got the overall good picture or the good idea about the clustering and so that we can start now the another algorithm of the clustering. Hopefully, you will be able to grab it easily. Let me restart our recording session before starting this new lecture. If there is any question from previous example, you can still ask. Otherwise, I'm going to start this again.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











