Prototype-based explanation (ProtoNet) is an interpretable deep learning architecture that makes image classification transparent by comparing local image patches against learned prototypes. Instead of analyzing entire images, the system extracts small patches and calculates similarity scores by measuring the L2 distance between input patches and learned prototypes. The similarity scores are then combined through a weighted linear combination to generate class predictions, providing case-based explanations that justify why an image belongs to a particular category. This approach is particularly valuable for high-stakes applications like medical diagnostics, where understanding the reasoning behind predictions is essential.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Explainable AI Protopnet explanation techniqueAdded:
Very good evening all.
Today we are going to see the video lecture series on explainable and in this video series we are going to see about the topic called prototype based explanation that is called protoet.
So this is one of the another important topic that we are going to see. Uh basically uh the prototype based explanation is quite different than what we have discussed in the earlier lectures that is in the first lecture we have discussed about glass box, blackbox and after that lime shap cam net to vector how the information is processed by using vectors and all these things we have discussed in the previous lecture series.
But in here we are going to see about the protoet that is totally a different concept but it is a very easy to understandable uh theme that everyone can able to uh realize how easy the information is getting explainable. In that aspect we are going into the lecture series one by one. And here you can see this is the actual blueprint of the thing what we are uh uh taking into account. This single diagram itself uh explain about the protonet that is uh what we are going to explain uh hereafter in this lecture series.
Everything is tell in this single slide.
You can see the word this looks like that. This looks like that. That means what is the meaning is on seeing a single piece we can able to realize this is going to be this. This is what we are going to speak about. Suppose we are seeing a tusk of the elephant then we can without seeing the entire image we can tell that this is the elephant. We no need to scan the entire image of the elephant. Just we are showing a tusk itself. We can for example better I can draw here so that you can understand right. So if I draw here like this okay some random diagram I am drawing.
So please bear if you are like this and I am showing like this that's it.
Okay. So like this I'm showing. So example.
So instead of either we can the point of consistent maybe either this portion or either this portion like this. So on the seeing these two points itself we can able to tell we can able to predict this is a elephant like that. Why on what basis we are predicting based on the two consistent patches one is tusk and another one is the trunk of the elephant. So we can easily tell that with no other elephants are having the long trunk and tusks like elephant. So we can easily tell that this part or this patch is a part of the elephant.
So in the same manner you can see here here the we are projecting only one portion. This is called a patch of the image. So this consist of the feather colors and the wings height of the wings and everything. So here we are going to calculate the distance between the two patches. So based on the closeness of the patches we can able to identify this is confidently this belongs to a J or a blue J or a sparrow or a eagle like that we are going to predict whatever the images we are showing based on the patch information only we are going to predict what is the actual image without scanning the entire pixels of the image.
What is the advantage is confidential the confident will be increased manner as well as we can able to uh tell that due to this reason I I am come to the conclusion that this patches belongs to this animal or this bird or this disease like that. Okay. In the rest of the series, we are going to see how differently we can able to monitor or uh predict the patches of your certain criteria. This is the this is called a protonet. So pro.
So if you are if you are building a big uh engine for that we are having small prototype toy engine but based on that we can tell what we will tell this is a railway engine okay but that is not actual that is just a prototype. So like that based on the patches small patches we are going to tell that this is the um this is the image of a particular object or certain text. Okay, that we will see one by one hereafter.
So the protonet is a interpretable deep learning architecture designed to make image classification transparent and explainable. This is the main thing you have to understand. This is a interpretable architecture that make image classification transparent and explainable. Okay. So by means of comparing local regions with learned prototypes already we are going to train with similar patches. Sparrow means these are all the things different different patches we are going to train.
When you give a new input, the small patches only will be that region of interest only will be taken into account and depending on the closeness of the earlier learned objects or parts we can come to the conclusion that how much close the results that much confident is the uh score. So this is the theme of the proto type based explanation. Okay.
So this is the salency map. You can see here the post hook salency.
Uh we are posting some heat map over the sparrow image. But it is we are it is uh on seeing two or three times only we can came to understand that this is the image of the uh sparrow or a bird. Since it is looking around so many parameters outside, inside and glow only one portion is super imposed.
So the model uh where the model looks but we it is looking somewhere okay in the around the neck portion or the wing portion we are showing only actually the model is looking where but we are not explaining about the logic why the model is looking in that portion that we know.
So the confirmation sometimes the model may look perfectly or sometime it may not it may look in the surroundings that it may predict. So just like dog and husky so the based on the snowfall it is predicting it is husky and non- snowfall it is predicting it as a dog like that we the logic is totally nontransparent that is it is looking somewhere and it is predicting so fails to provide legally accountable evidence. So on what basis you are telling only based on the um uh particular region you are telling uh like that means there is no proper answer but we are going for in reality we are look we are going to form patches and these patches are trained. So depending on the reality so on the previously learned patches we can came to conclusion that this is the the image of the particular object. So this is the mimicking procedure. Mimicking procedure means we are similarly we are creating creating the patches. Several several data sets are trained. Several patches are created. based on the closeness of the patches we will come to the conclusion that the particular space matches with the particular prototype.
Okay. So this is the thing why this is called as prototype based explanation.
So different patches are there. So on matching one patch with another patch based on the closeness of the two patches we can able to reason that due to this reason this particular object belongs to particular criteria like okay so this is the detailed u architecture of the protonet first we are taking one just image for reference and here we are extracting the features whatever may be so 7 + 7 + 128 layers. So here what we are doing is each and everything is categorized into different different prototypes P to P15 like that. So and then what we are doing the particular patch is mapped to a particular array. So this portion is mapped P not that means uh either the topmost layer or the bottommost layer or the middle layer everything is mapped to a particular uh that is called a consistent point right that consistent point where we are mapping that holds the P value. So different different images we are giving means the topmost portion is once again mapped to P not middle portion is P1 depending on the training level it differs and then we are uh connecting it everything we are connecting it to a fully uh learned weights based on that after that the same procedure the weights will be adjusted and we will generate a logit normal and logit pneumonia two cases. So based on the comparison of the prototypes we can able to come to the conclusion that this image is the person is suffering from vimonia or not like that based on the similarities. Okay.
So here is the well explained diagram you can see here. Now I am taking the uh thing called sparrow as explained here.
And first we are calling it as backbone.
uh L that is the dimension is height, width and uh this information HWD protocol. Okay. So it is converted into a arrays. The image is completely splitted into parts and it is converted into arrays. Okay. That is called latent feature map. Earlier we I showed now this is called mapping. The same thing I am showing here individual individual maps. After that what we are doing uh we are going to compare with P1 already learned image and already learned prototype with newly extracted prototype. This is equivalent to P1 or not. Already P1 is having certain patches with that patch this new patch is matching or not just like blood sample we are comparing this sample is equalent to this sample or not. Next sample next sample like that. Similarly P2 is equivalent to this or not. PM is equivalent to this or not. Based on that what we are doing we are calculating the similarity scores. How similar? You can see here the first portion we are seeing 3.954 and next portion is 1.447 and 2.6. This is computed by that algorithm prototype algorithm. It will compute by itself based on the weights.
and finally 3.578.
So different different images are having different different weights and similarity scores. And finally all are combined together in a linear head that is weighted linear combination of the similarity scores. This is what I told earlier. Okay. So based on the linear weights it outputs certain class logits and based on the class logits blackfooted alpha indigo bending card ordinal other combat. So which is having highest uh score that will be the uh assumption.
So 2.617 is the highest score. So that means the given image is definitely a cardinal based on this uh thing. So the patches is matching with that. So based on the patches we are calculating the similarity score which is the most highest similarity score that result will be my output. So we can easily explain this to the um aspirants that I came to the conclusion due to this.
That's all. Here we are not telling where the model is looking. we are telling how the model is telling that this is this is the important difference between um uh the earlier traditional method and this prototype based explanation. This is what we are telling. Okay.
Um so this usually first we are going to process the input into a is a dimension three-dimensional feature that is height width and diagonal that that mapping HWD format okay that I will tell you in afterwards so this is latent similarity is directly inversely proportional to the L2 distance okay so so we are going to find the similarity between two cases And uh how close how inversely proportional right? So how uh smaller the distance that will be the expected score. So calculate the L square distance between J prototype and the spatial patches already trained patches with the input.
Inverts the mathematical distance into similarity scores. So by taking the inverse we can able to convert that mathematical distances into uh similarity scores. Then we are generating a heat map and then the presence of proto prototype we are going to estimate. This is the logic behind how the prototype net is or this is the um uh formula between how we are estimating the similarity score. This is the formula. Okay. So this uh probability that is P of J is the probability that is the distance and the the mapping is set A is the mapping of uh two patches already I told what are all the how the prototypes are patched.
So here we are taking only A and B. So A and B how it is closely related to each other that we are calculating here.
So this is the three stage. First what we are doing join the SGD.
Okay. So uh surface one is just uh this is the backbone stage that is CNN stage.
So this cross trophy plus cluster loss plus separation cost. So these are the joined uh pattern and phase two what we are doing we are prototype projection we are projecting the pro proto first we are getting the images and we are converting into prototypes into small small projections and we are focusing and finally the weight matrix is adjusted and based on the positive weights and uh uh positive weights will gain the uh rewards and negative the negatives will get penalized that is filtered. So based on the positive weights and high scores we are we can able to uh predict whether this model is this model belongs to this particular category. So this is the once again this is the detailed explanation what we have done. So this is same sparrow image. You can see a particular spatch that is converted into layer and it is splitted into different different prototypes and uh what is the maximum pool that is the distance score everything. So again we are same diagram only but we have converted this into a well definfined thing same explanation previously what we did same thing only here we are doing. Okay.
So here this is the uh step-by-step architecture. First is we are going for the feature extraction uh that can be done either by resonate or VG. So input image and output features is uh splitted it based on the height, width and dimension. We are just um uh uh splitting the uh input image into different layers and based on the layers we are going for prototyping.
So that is mapping to the array that is the second stage and third stage is based on the fully connected layer maps we are find estimating the similarity scores for the for doing the class predictions. So final output will be in the lodge bits logits that is logical bits it may be okay.
So uh let us take this example classification of your bird image into either sparrow or eagle. So input image I am taking for for your understanding we are considering this function as a 4x4 function.
So what is the next stage is prototype one sparrow that is small beak. So I am giving the weightage as 1a 2 and sparrow short wings that is 5a 1. Eagle large curved beak 4a 5. Okay. And the small means one like that short wings like that it is classifying. How it is classifying that depends on weightage and training that we are unable we cannot explain in a manual mode by studying the algorithm and training weights only. We can able to assume the weight on a particular pattern. So similarly broad wings it is having highest weight 8. This is the thing what we are this is we are estimating in a two-dimensional fashion but algorithm will work in three-dimensional mode that is height into width into d we are telling that but here we are showing for simplicity we are showing height and width only okay so uh so we are computing using this formula norm of minus how it is calculating let us check 4a 4 - 1a First 4A 4 is the standard.
This is the input image. These are the learned patches. Okay. So, so according to prototype one, 4A 4 is compared. So, -1A 2 we get -3. Similarly, 4A 4 - 5A 1.
You can see here next short wings. So, we are getting the value as - 10.
Similarly, 4A 4 - 4A 5 -1 and 8A 1 - 25.
This is how the similarity computation on how how close the similarity is inversely proportional that is what I'm telling um minus one will will not be the final score this is too close so it is not the result which is too far that will be the due to inversion it will be converted into the scores okay so this is how the uh see here uh the beak is um matched the prototype of 14th prototype we are training for a big classification. So it is matching with this and similarly prototype 14 matches with the uh wings also and similarly prototype 14 is matched with the this portion also. So the the similarity scores will be uh looking like this class weight into 1.180 similarity score is 6.449 449. So based on that we come to the conclusion that 7.669 points we are getting. So how it is predicting we are um so the based on the prototype 14 we are mapping these things. So prototype 14 is nothing but the given input image is matching with all these properties. This beak size, wing size, tail size, color everything it is matching and based on the similarity score which is the highest score we can able to uh identify that this is the uh this object will be like this. Okay.
So this is another diagram uh physical reality.
So actual latent patch we are uh uh not seeing the entire thing one consistent portion only we are checking.
So every provided during the interference is tangible visual pieces of historical evidence not a hallucination. So it is not a wrong it may be like that not in a particular portion due to this impact I am telling that this is the bed that is what is the thing. So every 10 10 epo only after 10 training the abstract latent prototype vector with the nearest actual latent patch will be taken into consideration training set sharing the same label. So every 10 epochs this will be matched with the uh input image and it will check whether we are getting the same scores or not. every 10 scores if it is not getting means that will be um we cannot able to proceed further. So every 10 times uh we have to check the near actual latent patch is matching with the training set or not approximately if it is okay it is getting closer means we can continue otherwise we cannot continue with the uh applied image we will go for next next next patches to compare. So only one image we are comparing for 100 or thousand time means it is waste of time.
In the 10th patch itself result is not coming. Accuracy is very less means it is not possible to concentrate in that particular image once again. So we will go for another set of patches with another 10 points. So whichever similarity score we are getting based on the training will be increased. Okay.
So here this is the nonlinear evidence order and two things. So prototype you can see here similarity is 8.91.
Okay. So uh replaces the flat linear head with a soft just on decision tree we are eliminating.
So exact visual prototype triggered a false positive enabling immediate targeted retraining. So the based on that what we are doing uh if it is not um the expected means it will again uh enable the immediate targeted retraining once again it is trained with different patch that is uh the thing. So here also you can see uh rigid base prototype means only it will array will be mapped to a particular thing. But nowadays what we are doing we are deforming the prototype into different different prototypes. P1, P2, P3. Each and everything has its own offset values.
That means wings has its own. This portion has its own. This portion has its own. This portion has its everything is having individual individual weightage. So a single image is having different patches and different patches are having different weights and based on the different images we are able to compare the weights. If only one thing is also not matching also we will go for training with the different portion that is this is the thing what we are okay right now you can see here uh here this global branch so only the mask portion mask of the image is shown here in the next time what we are doing we are creating patches and on this is called net that is uh uh binding of different patches. After estimating all these things, all the patches are binded together and the mask layer is formed. Directly forming outline instead of that forming all the patches just like in for children we are giving a small small dot. After that we are asking to draw the image. So it will be very perfect. In the same manner only we are doing all the patches are binded together and we are forming a line. So uh after that we are if you predict means what happened exactly we will get the output. This is called based on the vision transformer we can able to eliminate the uh noises what we are actually while masking the image background noises also will be added instead of that different different patches if you w mesh or mix it together you will get the exact uh outer layer of the image and if you predict that means heat map means exactly it is seeing the you can see here the exposed heat map be seeing the face of the bird.
Okay, that is the all distractions will be automatically eliminated.
So here this is the important medical diagnostics. So you can see here the uh what we call the tumor region. The tumor region is not only based on the outer contour. it is we are having different different perspection in this uh uh dimensions are changed. So we are converting the 2D into 3D mechanism MRI image and then what we are doing based on the masking this image is converted into so many patches and finally all the patches are mixed together and we can able to impose a heat map means exactly the heat map will be imposed on the region where the tumor is present. This is highly uh this is high accuracy as well as explanability will be this.
So this patch is having high similarity score. So everything is binded together and we are getting one outer layer that is imposed heat map is finally imposed to locate the uh uh tumor region. This is what this is not background scanning.
This is uh that is netting of different patches. That is the thing. Okay. So for researchers this is the very important task. Now you are everyone are doing research in medical imaging. Those who are doing research in medical imaging this is the cakewalk or the spotlight of the day. That means uh entire entire counter region and you are taking the information and you are filtering and you are striding everything you are giving instead of that that small portion is divided into different different patches and we are estimating the similarity score. If similarity score is we are getting means what we are doing that patches are combined together and that combination patches we are forming a filter and over that filter we are imposing heat map so that the prediction will be exact and explainable. This is what we are uh this is medical protonate training one of the important research area nowadays in explainability that is purpose are nowadays coming based on this pro medical protonate. So those who are working on especially on medical images hereafter shift your domain from u prediction to explanation. Okay, that is very very important and it it will give you the scope of your research also. Okay. So based on that what we are telling -3 and -10 this is what s1 + s2 two sparrows okay previous things 1 minute these are belongs to sparrow category -3 and - 10.
So here you can see -3 - 10 you are getting -23 and here -1 - 255 - 26. So based on this which one is having highest similarity score minus 23 only.
So that is the image is a sparrow image.
Even though eagle is having minus 25 the first s3 that is wings that is lagging.
Okay. So you can see here how it is classifying minus one is large curved big. It is having broad wings but the big large curved big failed. So sparrow is having small beak and short wings. But here large curved beak is not present in the eagle. It is a large curved beak will not be there for eagle. Okay. So there it fails. Uh so but the input image is small beak but large curved beak is not present in the input image. So it fails.
So minus one. So broad wings also maybe it may some sparrow may have broad wings but it is but beak it is particularly failed. So what we are telling is based on this we are telling that the predicted class is a sparrow like that.
Okay depending on the patch and the similarity scores we come to conclusion that the predicted image is a sparrow image.
So here the explanation is given. Even though the bird has an eagle-l like beak, the poor match with the second eagle prototype weakens the eagle score.
Second eagle prototype poor match with the second eagle prototype weakens the eagle score. That is what we have spoon.
So provides interpretable and case-based explanation. This is a faithful explanation also. We can justify that with the visual based on seeing the image. We can do explanation and it is suitable for high stakes application.
High stake application means just now I showed such as medical imaging like that.
So what are the things is prototype comparison instead of comparing single image we need to split it into different prototypes that is 14 to 15 prototypes and we have to do comparison that may be little bit expensive while doing with the higherend GPUs and uh many classes and prototypes may come that may be scaled to uh scaling problem will come since each and every prototype has its own dimension. So converting that into similar dimension will take uh scaling problem will come.
Once again we need to rescale everything and uh quality of the learned features.
Suppose uh already one image is trained means that is having poor quality and presented image is having good quality means then there is a uh problem will come in uh performance based on the performance of the learned patches and the input. So image data um mainly applicable to image data with limited generalizations that is it is not applicable for very complex images. So generalized images we can form into patches and we can do um these things um this type of uh classification.
So these are all the things about protopnet and we are having two more things to discuss. uh I think it may be discussed in next uh video sessions.
uh that is counterfactual and uh counterpart like that counterfactual means what um we are giving the explanation like this is eagle this is the eagle that's all that is your loan is rejected that's all that is called counter action counter action that means uh when you ask the algorithm it will tell you that your loan is rejected due since u your credit score is poor that's Counterfactual means what means? Same explanation we are telling your loan is rejected since credit score is poor also it will give suggestion. So please maintain the higher credit score and uh high impact salary. If you increase your salary amount and if you don't if you maintain the credit score as much as less as possible there is a possibility of getting loan clearance.
These are the two uh theories theory or definitions you have to learn in explainable that is counterfactual means it will give the prediction as well as suggestion also. If this is the case means you will get if this is the case you are you will have the less chances of heart attack. So you have to do exercise you have to reduce the diet.
These suggestions the algorithm itself will give but whereas counter action counter part is nothing but due to this you are having the there is a chances of heart attack that's all. So these two parameters are simple only two line explanation only but in the upcoming video series these two things will come in as a backbone that's why I am explaining in this slides only. So please make a note of two parts. One is counterfactual and another one is counterpart. So that that portions you just take care and in the upcoming videos we will see the new technique and we will explain uh how the hybrid AI is going to uh tackle with this problems that this is the thing what we are going to what we have learned in this lecture.
So uh once again thank you all for um uh concentrating on this lecture. In the upcoming series I will show you a another set of technique which is going to provide you a meaningful explainable patterns in a meaningful manner. So uh once again I wish you all your good luck and uh uh sorry for the intervention once again. Here I am stopping the slide for you.
Once again as a recap we will go ahead like this.
So so this topic is here the summary is prototype based explanation. What is the main use means we based on the visual patches we can able to justify that the applied input images belongs to the particular class or not based on the patches and the similarity score. This is how the uh prototype explanation comes into the picture and forms a transparent and explainable approaches in the explainable AI uh pro that is comp that is making the prediction into explanation in the explainable AI domain. Thank you for watching this lecture. Have a nice day.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsβ’2026-05-30











