Hierarchical clustering is an unsupervised machine learning algorithm that addresses the key limitation of K-Means clustering by not requiring the predetermined number of clusters (K). Unlike K-Means, which needs K specified beforehand and can produce suboptimal results if K is incorrectly chosen, hierarchical clustering automatically discovers the appropriate number of clusters through recursive partitioning (divisive/top-down) or merging (agglomerative/bottom-up) approaches. The algorithm can be visualized using dendrograms, which show the hierarchical relationships between data points and allow users to select the desired number of clusters by cutting the dendrogram at different heights. This makes hierarchical clustering particularly valuable when the true number of clusters in the dataset is unknown.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI Lecture Room Hierarchical ClusteringAdded:
Okay. All right. So, let's start this one now and let me show you the visual perspective. So, this is our second algorithm of listing. This is of course different from the previous cam in listing.
This is the so these are just a quick recaps about the came in clusting algorithm. CIN clustering is the partition unlabelled examples in distant subsets of clusters.
It is actually describing that we are using kind of unsupervised learning data unlabelled data for the clustering.
Samples within a cluster are very similar. That is why we want them to be grouped with each other. And samples in different clusters should be very different. That is why we want them in the different cluster. Discover the new categories in an unsupervised manner.
This is what we call the clustering. So gaming clustering is a very famous simple and powerful unsupervised machine learning algorithm. We have seen that it has been applied to many complex unsupervised machine learning problems.
And next is just the theory things.
Okay. So the K mean clusting algorithm tries to group the similar items in the form of clusters as you have seen in multiple examples. The number of groups is represented by K, it should be given beforehand. Yeah, this is what it is in C mean clustering. It is a kind of limitation of Kin clustering that if you cannot predetermine that how many different clusters or groups you want in your data set, you cannot apply KN clustering.
But this is what you will see the good point of hierarchical testing in today's lecture because of course in most of the cases we don't know how many different groups we have in our data set.
>> Okay. So in today's >> Yes. Yes. Go ahead.
itself learning algorithm.
>> So the learning of this algorithm is actually the final updated centride values. Okay, good question. So uh the question arises to everyone's mind is basically why we are calling it a machine learning algorithm? What is actually the learning in this algorithm?
Yes, good point. So uh as I I stated you for example you given it a data set of 2,000 points means 2,000 samples or in other words um as we discussed in our example today means the data set of 2,000 customers visits in any store. So what it will do is it will do all the calculations of course what we have studied so far. Uh it will do the cried calculations and try update updations distances calculations cluster assignment cluster evaluation but at the end once your came algorithm stopped it gives you the final centr values. Those actually final centride values helps you to to kind of calculate that whenever you have a new customer or whenever you have a new point that you don't know belongs to which segment or which cluster you can directly calculate only the distance of that particular point from all the finally given updated centride points and you can directly see or assign that particular new point to any nearest cluster. This is what the learning of amin is because it gives you the final centrides. Now you don't need to run the complete data set and the complete calculations again and again.
You just have to calculate the distance from the final centrides of every new incoming sample or you can say the data point or whatever you can say this and the new incoming sample will be assigned to the best possible group or the best possible cluster on the basis of the distance formula's value. Whatever the cluster will have the minimum distance from that point. This is what?
No. No. uh even in that case if you are getting more data points more than one more than two more than five more than 20 whatever the number you are getting you will be calculating the distance of each point or the each sample or the each number separately using the final rights that was already given to you by the CN clusting algorithm because the new incoming points are now basically um the points that needs to be assigned to any cluster. So the running uh again came in clusting algorithm is optional.
So you are kind of doing retraining of came in clustering for the new centroid calculations if you are doing it this way but actually we don't do this way at the end supervised learning.
Uh so the new data sets in most of the machine learning algorithms are basically used for the for the inference. U what inference means is basically whatever the knowledge you have given previously to the algorithm either it's unsupervised algorithm or supervised learning algorithm. Um this is what actually machine learning is because you have given it a handsome amount of data set or the examples already. Now you are actually trying to evaluate the models on the new data sets. So for this evaluation of course you will give the new data point as a test set and you will calculate the distances on the basis of the final centrids that was given by the came in clustering and yes if if you can uh evaluate it that the new assignment to this point uh of the CIN is not good are good on the basis of David Baldin index or sard score or many other evaluation criterias then if you see that there is some further improvement needed in the K mean then you can you should collect some more data set and you have to do again requesting improvement >> okay retraining yes in every machine learning algorithm further retraining depends on us we need to calculate the we need to evaluate the accuracy of the models and then if there is a room of improvement and there is a new data set also available then we go for the further retraining.
Okay. So uh in the cent initialization of kin clustering you remember we randomly select the initial centides.
Okay. This was the initialization sensitivity because if we are not able to select the better centrides then there could be a chances of more and more calculations in kin cluster uh there is someone's mic on please mute it so so repeat kines are basically for the uplation of centroidids again and again and kin++ is a smart centride initialization technique the difference between kmin and kmin ++ is simple that in K min we initialize the initial um centr randomly. In K minus there is a simple trick you can say uh it tries to select those points as a centride initially which are very away from each other so that the intercluster distance or the convergence can be made faster.
So that's just kind of trick in K minus+.
Okay. So these are the steps of K minus+. Yeah. This is how you can actually calculate the maximum distance between the points and then you can consider those points as a centr which are very away from each other. For example, you need to initialize four clusters. If it is a four group problem and you need to assign the uh four initial centrides, you will calculate the four points in the data set. Those have very away from each other.
Okay. Now finally toward the hierarchical clusting. The idea of hierarchical clusting is good and we are actually trying to overcome the limitation of Kin clustering which was the initialization of K number of centroidids and also the the kind of limitation that we need to tell that how many groups we want in this data set. So in hierarchical clusting luckily we have no need to give um predetermine any uh value that how many groups we want in this data set because in most of the cases we even don't know that how many different groups will be in this data set. Another point that I should tell you is basically that we were discussing with each other right now the accuracy of came in clustering is sometime impacted because of the wrong number of groups as well. Sometimes our data set has uh six number of groups but we give the k value four. So of course the if there there are actually six groups belong in that data set and we are trying to make uh divide it into four groups of course the quality of clusters will not be that good and the same way if there are six groups and we are trying to divide it into 10 groups of course the quality of clusters will not be that good. So that is why finding the suitable number of groups in that cluster and giving that value to the parameter k is very very hard and very experimental. There are some techniques you can google for the better uh clustering value. You can uh fine-tune that parameter in a more better way.
And in hierarchical clustering you can see that the um incoming value is for example this is just to give you an intuition that how actually the hierarchy listing works. So um for example if u there is an animal as a as a as a input you can say top and we need to decide that from which category this animal belongs. So what we will do is we will basically um kind of distribute or divide it on the basis of different features. So first of all we will check okay this is an animal either this is a vertebrae or invertebrate. Okay, this is a vertebrate then we will see okay if this is a vertebrae there are chances that this could be a fish this could be a reptile okay this could be amphophobia this is a mammal or if this is a invertebrate it could be a worm insect okay so the the kind of concept of hierarchical clustering is on the basis of features or the differences you can you can apply a recursive steps and eventually you will be able to to separate uh and uh to divide a thing into the into the correct number of category uh the correct number of uh group.
There are two main algorithms in hierarchical clustering very easy. One is top to bottom and the second one is bottom to top. So you can do it either way. Okay, both are recursive. If you are uh trying from top to bottom, you will start from a bigger picture and you will break down breakdown breakdown into on the basis of different differences differences differences. You will separate it into different groups and eventually you will have the the required or the correct number of groups at the end. And uh in the second bottom to up approach you will start from very big number of groups. Okay, maybe you can consider every point as a separate group and then you can group the similar points with each other and start grouping up to top uh up to whatever the level uh you want it to be grouped with each other or we'll see in the couple of slides.
So there are two types of hierarchical clusting. One is divisive. Divisive is actually as you see the name uh divisive is uh from top to bottom uh and the elomerative is from bottom to top.
So once I show you any example you will get a clear idea about this. So let's see uh first the receive listing step.
So in the receive what we do is because it is uh top to bottom. So at top initially we consider all the points in the data set belong to one single cluster. So whatever our data set is we consider it as a single cluster because it is a top to bottom approach and then on the basis of differences between those points we we partition them we actually separate them from each other on the basis of differences.
The second step is partition the cluster into two least similar cluster. then then again apply the same thing on each cluster and then again apply same thing on each cluster. So proceed recursively to form the new clusters until you get your desired number of clusters or the second thing that I will tell you later on the basis of which you will stop actually the stopping right here. Okay.
So let's see this is a data set you can clearly see this is just for illustration purposes of course you cannot clearly see this in in case of uh real data set. So uh just for this example purposes you can see actually there are three groups in this data set.
Okay, what we did is we consider all the points you can see you we consider all the points from this single group because we are applying the VC algorithm. Okay. So we considered all the points from a single group and then in the second step on the basis of the differences of course we are again using the distance algorith distance formulas for to calculate the differences between each other and many uh we will see the other formulas in using in this algorithm in next slides.
So on the basis of the differences we first divided this into two groups. All right. And then we apply the same differences or distance calculation formulas on both of these separately.
Okay. And in the next slide you see we can uh next uh figure you can see that we can clearly see that it separates out because there is of course huge difference and the distance between these two. So it made it a separate cluster. Eventually we got this output.
It is similar like CN clustering. Output is similar like CN clusting. But the good thing is there is no need to give it uh beforehand that I want to create the three clusters. It will do this automatically for you. Even if you don't stop it one important thing if you don't stop it at this stage it will further create the clusters within clusters.
Okay. So even if you don't stop it, it may go to a single point to make it a separate cluster.
Okay. So how we actually partition them? This is what the point is. This is what I was saying to you. You can see in this cluster, if we don't stop this algorithm, it created this as a separate group.
And as I said to you, if you don't stop it, if you don't calculate any evaluation criteria to stop this algorithm, it may go and make your single point as a cluster, every single point in the data set as a cluster because it is a top to down approach. So of course in bottom or on the leaf node, you will have then every single point as a separate cluster. So what we do to stop it or to calculate? So we find the sum of squared errors of each cluster and choose the one with the largest value. Okay. And on the basis of that we partition it from the other clusters.
Okay. The second is the elomerative clusting. This is basically bottom up approach.
Okay.
You can I hope directly relate it now.
So what we do initially is we consider all the data points as an individual cluster.
Now we are considering every data point, every sample, every customer whatever you can say according to your data set as a separate cluster and then we calculate the similarities between each other and then on the basis of the similarities we group them and move towards upside. Okay, grouping them, grouping them, grouping them. And take the two nearest clusters and join them or the group them to form a single cluster. Then proceed recursively until you get the desired number of clusters or desired number of accuracy or the similarities between the clusters or any other stopping condition met you can stop. Otherwise just like divisive if you don't stop elmerative at the top of a glumer or at the end of elmerative elumerative will make or merge your whole data into one cluster.
Okay. So this is the data set we have in step one. What we did is we considered every single point as a separate cluster. Cluster one, cluster two, cluster three, cluster four and same thing no difference just calculate the distances okay of every kind of point from each other and then merge the very nearest points with each other. You can see this has been merged, this has been merged, this has been merged, this has been merged and this has been merged. Okay. In the next iteration now you calculate the distances between these two this these two and this and then on the basis of this distance you will again merge and of course these three points will merge together and you got this cluster and these will merge together and you got this cluster. If you see or evaluate the accuracy of clustering to me this clustering seems good. If I want a very narrow level clustering, this clustering seems to me good. The the points in these clusters are very uh kind of good.
Their dispersion seems to be good. But if you want less number of clusters because total number of clusters at this stage is six. You want less number of clusters, you can run it for one more iteration and you will get this. Of course the quality of this clustering is good and you are getting two number of cluster dispersion will be good at this stage. But if you don't stop this algorithm this will make your complete data set as a single cluster and now you can see intercluster distance is low and the intracluster distance is high. So dispersion or the quality of this clustering will drop automatically.
Okay. In case of divisive it is a top to bottom approach we are considering all the data points from the single cluster and then in the second iteration we calculated the distance of every point from each other and then the points those have the less distance from each other we are grouping them and you can see we considered these two clusters and then of course we will calculate within each cluster the distances from each other and then we will merge. Uh so the same way if if we see the kind of u clustered quality to me this cluster seems much better. So 1 2 3 4 5 and 6 point. So if we calculate the the clusters quality score. So this clusting seems to seems better to me. But of course if we are not interested into six number of clusters okay we want more number of clusters we can run it for more one iteration and at this iteration it will make your every point a separate cluster. So it is on you where and how you want this algorithm to stop. After your desired number of clusters achieved or your desired number of clusters quality score is achieved or your desired number of iterations or steps of this algorithm has been completed sir.
Yes sir.
>> Stopping condition number of clusters define.
>> So there could be three stopping conditions as I mentioned. One could be your desired number of cluster. The second could be the quality score of the listing and the second uh the third could be the total number of iterations that you want to run this algorithm.
Any of these could be the stopping.
So previous.
>> So you are talking about this one right?
>> Yes sir.
>> Okay. So uh what we did is first of all we considered all these points uh as a single cluster and then of course we are calculating the distances of every point from the other point and then um I mean it's very easy we can clearly see that uh the distance of these points are less from each other so that is why we make them or group them uh as a single as a separate cluster sorry and the distance of these points um are less with each other and high from these points. So that is why on the basis of this dissimilarity we separate these points as a separate cluster and in the same way now next we will calculate the distances again with each other. We will see that if there are high distances, high dissimilarities between these points, we will consider those points as a separate cluster. And you can see it here.
This is how it actually works.
Okay. So this is an example of another data set.
You have 1 2 3 4 5 points on a graph.
You can calculate their XY values and you can maybe use those values and of course you can draw such kind of so dendrogram is very important thing in hierarchical clustering. So for Kian clustering you know we visualize it on XY position as a as a data points. So in hierarchical clustering what we use is we use dendrogram for the visualization most of the times and the dendrogram is actually showing that because we are using the hierarchical clusting this is how actually it uh looks like. For example, if you see that this belongs to one group, which means your cluster number uh four and three and this belongs to one group which means you are point number two and this belongs to one group which means you are point number five. This belongs to one group which means you're point number one. Okay. So if if you see from this lendagram you can directly relate the number of points and the points those are covering under the same group on the basis of the dendro. Thank you.
>> Okay. Yes. What I'm trying to say is basically in the dendrogram you can actually see that which points are actually belonging to which group. Okay.
So you visualize this on the basis of gender. We can visualize both the elomerative and divisive using the dendrogram. So if we see in a bigger picture and we see as a big cluster as a big group. So we can see that this big cluster has how many points? Sorry the line is not getting drawn straight. If I draw a straight line, so this can tell us that this cluster covers the points between this and those are the four and three and this cluster covers the points.
Sorry again not can draw this straight line. So anyways, I just wanted to show this dendrogram actually shows you the coverage of the points in that particular cluster like which points belong to which cluster.
Okay. So you can so how many different clusters are there in this data set? You can see the number total number of the bars in the dent. So 1 2 3 and four.
Okay.
So there are four different clusters. So if you see on this data set so these two belongs to should belong to one cluster.
This should be a separate cluster.
This should be a separate cluster. This should be a separate. So there are total four different clusters in this data set. So you can visualize this from this image directly.
What we can do next is okay. So this is what it is. Yes.
and okay. So the dendrogram actually shows you the relation of of because this is hierarchical clustering it is it shows you the relation in different perspectives. For example, if if I want that the group should be very dissimilar to each other. If I want very dissimilar groups, I will consider a very narrow kind of u uh the the dendrograms bar. Okay. If I want them, okay, so I I I am able to accept the a little bit dissimilarity.
So I will accept then. So very similar if we we are only accepting very similar points. So these two points will belongs to only this cluster. And if we want to accept it, okay, we are we can accept a little bit this similarity. So we will go to the parent of this bar. We can accept the parent of this bar as a part of this cluster. And then it will become up to this and it will cover this as a whole. And if we want okay we we can accept um a little more dissimilarity but we want less number of clusters in our data set we we don't want to have too many number of clusters then we will accept the parent of this as well and in that case of course you will sum this up.
Okay. And your cluster will become like this. And because this is again uh you can say the part of this main parent. So eventually this is what I was saying to you. You will not stop your algorithm.
It will start from good clustering or kind of separate clusting and it will goes up to the complete data set as a single cluster and you will end up with this clustering.
And if in the same way you see it as a top to bottom approach. So first of all you will consider it okay what is the the biggest group and how should we consider it?
You will start from top and these points will belong to the one cluster at this stage. Then you will okay I want more similar clusters. I want to reduce the dissimilarities. It will give you three clusters.
Let me show you.
If you want that no I want this dissimilar cluster. It will give you three different clusters. One 2 3.
And in this your uh you you can see the different number of three different number of clusters in your render graph. And if you say that no I want uh a little more similarity between the data set and more number of clusters you can even come one step down to see that. So uh sorry one minute let me draw it very clearly.
If we want that a top to down approach the complete number of clusters it mean you will consider the top cluster and the top cluster has coverage of these points. If you want little more similarities between the cluster then reduce the dissimilarity.
So you will consider a child and the child has these number of clusters and if you want more similarity and reduce the dissimilarity then you will have the three number of childs and this 3 five 4 and 1 The dendrogram actually actually show the visualization of your clusters both ways.
You can draw this. Of course, top down.
>> Yes. Yes. Yes. Yes. Yes.
And in the next slide you can see again one numerical example of the both elomerative and divisive clusting. I will of course encourage you please practice this. Um you can use this URL.
There is a good example there uh for the by hand numerical calculation and you can easily I mean practice this example. And the second link is actually showing you the animation of different clustering algorithms. You can go and uh see those for better understanding. And in the next class we might have some quiz or maybe any other thing related to this.
Yes, if there is any question you can ask otherwise we are already over time and we are stopping it here.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











