This session explains how AI agents can overcome their inherent limitation of forgetting information across sessions by implementing memory systems and Retrieval Augmented Generation (RAG). The four types of memory include in-context memory (within a chat session), external memory (stored in databases), episodic memory (decision-making processes), and procedural memory (learned patterns in model weights). Embeddings convert text into numerical vectors to measure semantic similarity, enabling semantic search over vector databases. The RAG pipeline involves loading documents, chunking text, embedding chunks, storing in vector databases, and retrieving relevant information to augment model responses. Chunking strategies include fixed-size, sentence/paragraph, and semantic chunking. Advanced RAG includes re-ranking and multi-hop retrieval. Key frameworks include LlamaIndex, LangChain, ChromaDB, and Pinecone. RAG failures can occur due to inappropriate embedding models, large chunk sizes, outdated data, or retrieval issues. Safety considerations include data sanitization, citations, and freshness awareness.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Memory, RAG & Knowledge Systems for Al AgentsAdded:
Hello everyone. How are you doing?
So are you guys ready for today's session?
So today we're going to do our third session on agents for India track and we are joined with Sohham from the smartly team who is going to explain everything that you all need to know to contribute to be a part of the agents for India track in girls script summer of code. So just write ready if you are guys are ready to begin with today's session.
Show us how you are excited today for the session three.
Spam in the chat guys. Spam it.
So good to see so many of you joining for another session. Now I'm going to hand it over to SOM to take forward today's session. So the stage is all yours.
>> Thank you so much.
So let's begin everyone.
So today's session three is going to be all about memory rack and in audit systems. In the previous session we studied about what is the agent, what are the tool callings is the and what are the function calling and how the agentic loop works.
In this session we are going to dive deep on memory rag and how does a knowledge system works inside AI agents.
In this session, we will understand why every model tends to forget their information whenever we talk or give any assignments to the agent and how we can persist this information across all the chats or in a single chat as well.
Let's move ahead. This is going to be our agenda of today's session. There are going to be four gates that we need to clear up. First gate is why memory?
What are the problems that we need to tackle so that we can add memory to it.
Next is the vector databases. What embeddings are? What are the semantic searches and what frameworks or the models that you can use for the vector databases. Next is the rack deep dive.
In this gate we are going to dive really deep into what is retrieval augmented generation that is rack. In this we will dive deep into the entire rack pipeline.
What are the chunking strategies and what is the difference between uh knife rag model and what is the production rag model that is the advanced rag model and so on and so forth. Finally we will see and try to build and shape a model and understand what are the examples that you can build upon going forward. This is the first gate that we have encountered. It says that the model tends to forget everything whenever a session ends. Let's take an example. you have a chat GPT API with you and you are calling the API using your any Python code. Say for example, you asked a query that hey my name is Rahul and I have this and I have this Python error that I'm getting. Can you please solve this issue? And then so on and so forth. The model solves the issue. And next time when you call the same API again and you ask a different issue which is linked with the prior issue that you came across and at that time model tends to forget model does not remember what was the earlier issue and in this way we say it as the model tends to forget everything because model doesn't have a built-in memory with it.
Now when we add a memory model tends to remember everything. What was the factual things that have happened when and what decisions I took and what was the chats I have been in conversation with the users so on and so forth. So chat over here have you understood what the basic sense of memory is required for the model?
Type memory if you understood over here and if you did not type in LOL with the query.
Spam in the chat people. Spam it.
This due to this exact problem we tend to add memory in our language models and in our agents.
This is the core reason.
Awesome. So let's move ahead with the next part.
Now here we have four different types of memory and basis on this scenario we tend to use them. First is the in context memory. Let's take the example of chat GPD. You open a new chat. You type in your query. Now as you go on typing your queries again and again so on and so forth, the model tends to remember what what you have asked asked in the same chat for the first time, the second time and the third time and so on and so forth. It remembers everything.
But as soon as you click on new chat, it does not remember anything.
Now this is called as the in context memory. It only remembers what you have been chatting in the in a particular chat window. This is called as in context memory.
Next is the external memory. We try to add some databases like PL/SQL or SQL which has a particular data type which we will study in the upcoming slides what exactly that particular data type is. But what it does is when you have opened a particular chat of GPT or any other uh tools, it stores the factual data inside it. And next when you open a new chat, it tries to store the factual information of that chat as well. So whenever you switch your chats or you click on a new chat, the model tends to remember what you have been conversing with the model in the previous chats and so on.
Next is the episodic memory. Let's take an example of cloud code. I have asked the cloud code that hey I am getting a python error. This function is not working properly. Now the now firstly cloud code will take a look at my environment or what python files do I have? What is the terminal logs that I'm getting and so on and so forth. It will take the appropriate decision and bases the decision it will fix the issue.
Now the the decision which takes as open the file, read the file, check what is the function declaration, implementation hierarchy in the Python and so on and so forth. It will fix it. Now this decision becomes the memory for the chat. Now this is called as episodic memory. What decision it took? What was the planning?
What was the action did it take against the planning? This set of pipeline is called as episodic memory.
Next is the procedural memory. So during the training of language models such as t GPT or claude opus, it tends to remember certain things. For example, if I'm training a language model on the on entire uh on some religious books and when I'm fine-tuning the model or training the model, it tends to remember in its weights. We all have been studying that machine learning model and a deep learning model has something called as a weights. It tends to remember some sort of information in the weights. That is why when we ask any information any any information in depth to the chat GBT, it tends to remember certain generic information about it. That generic information is stored in the model weights.
So these are the four kinds of memory that we can have for the model.
So up till here if you have understood can you type in memory and if you did not understand type in lol with the specific query that you have.
You can hickle me up in the middle and ask me all of your queries that you have. We will go on to the previous slides or I can repeat the slides once again if things are not clear for you.
Uh PPD is in already full screen. Uh you can enlarge the video. The max is this completely visible now.
Okay, I can go again with the episodic memory once again. So let's take the same example again. You have got an error in the Python function. You have written a code and when you tried running the Python program, it gave you some error. And now you copy the error code and and what issue that you are facing into the cloud code which is a tool. And now firstly before diving deep into fixing the code it will first check what all the Python files does it have and what is the error logs does the user is getting and then it will plan out how to fix this particular Python function that the that it has the error and once the planning is done in the planning firstly it will think that hey I first I I need to open the file I need to check where this function in which particular line this this Python function exist and then I'll check where was the function declared when was the function reused and I'll check this hierarchy this type of planning it's called as decision making see how I'm taking decision one after the other it is called as chain of thoughts this chain of thoughts is called as episodic memory like whatever decision I am I am taking right now it becomes a part of the memory and then when this decision is clear for me I'll take the action. So this decision for a particular act for particular planning X I did Y and this is resulting as zed. So this complete chain is called as episodic memory.
Now is this thing clear?
Tell me in the chat is this particular part of the episodic memory clear for everyone?
Cool. Let's move ahead with the next part.
Next is what is the embedding? We all have been hearing that hey this model has a very good embedding and or or this document can be embedded very quickly or or or this 5.5 model has a very good embedding and and it can read your PDF very well. What exactly is the embedding?
Well, we all have been came across a particular terminology that is semantic or similarity.
Let's let's take an example that uh you have read something about a refund policy in a particular terms and conditions of some organization refund policy and now you have some issues with the organization and you are asking hey how can I get my money back?
Now think about this. There are two sentences refund policy is just a phrase refund policy. The meaning of refund policy and the meaning of how can I get my money back. Can you tell me does the meaning is kind of similar to each other?
It's kind of similar, right?
So to understand how these sentences are similar or how these sentences are not similar or how much far away they are, we use a mathematical model called as embedding.
That is why we use embedding because model cannot understand the meaning behind a particular sentences. This is a human uh psychology to understand a particular meaning behind it and to comprehend the meaning and take particular actions.
So model doesn't have a consciousness to do it. So what we do is we convert a particular sentence into minus1 to +1 range. Let's take a example over here.
The cat sat on the mat. This is a particular sentence. Then I'll convert this entire sentence into a particular list of values. This is a Python list over here. It can be it it can be converted into any particular array for Java or C++ whichever framework that you are using.
Now this will be converted into the values between minus1 and one and then it will check that hey is dog similar to puppy or dog is not similar to invoice. Now just like our previous example that refund policy and how can I get my money back. Now the now the calculation between how these things are similar or how far away they are it can be done very easily when you embed the sentences.
Is this part clear for everyone type chat?
Yes. And if you did not understand lol with the query there are certain algorithms by which uh you can check about how close they are and and it can be BM25 which was invented by Google or you can use cosense similarity which is used by perplexity AI today.
So what you do is when you have the embedding you try to put it put it into a graph or to a cartian plane and you check which embedding is is close to the other embeddings and if they are close you try to fetch them that hey these two embeddings are similar we fetch them.
Imagine it in this way like there are GPS coordinates on a map and if if certain places are very close to one another that it it formed a cluster and those things in the cluster the entities which are closer to one another those things have a very less embedding spaces between them And we can classify it as they are very similar and they are same. So in the GPS coordinates when we have the certain chunks we call it as hey this topic is is can be a chunk one this topic is chunk two and when these chunks are far away we tell them that hey these two chunks are very far away and they do not share any similarity between them. These two are very different topics.
So is this analogy understood by everyone? Type yes in the chat and lol with the query.
Encoding is converting a particular text into a different form which can be understandable by any medium or the interface that you are using. Embedding is just used to perform calculations onto the sentences to check how much similar they are.
Okay. You want me to repeat again the GPS coordinates example?
Okay. Cool. So, so let's take an area of a land and suppose there are certain houses which are close to one one another and there are set of houses which are far away from the first cluster and what I can say is one house is one particular sentence and the second house which is closer to the first one is this is the second sentence. Now the distance between these two houses can tell us how similar that they are to one one another and and when this cluster is far away than the second cluster we can tell that these two are two different topics they since they do not share any similarity or if they share similarity then these two clusters have been a one single cluster.
So is this example clear for you?
Type yes in the chat. And if you don't type LOL with the query.
Cool. Let's move ahead.
Now next topic is search by meaning or by keywords. So before this we we have been developing some chat bots on our own and whenever we try to interact with our chatbot we say that hi hello hello and then we hardcode the hi hello hello in such a way that whenever the user types as hi we reply with the hello hi there and whenever if the reply comes up as hello then we return the response as hey there So we tend to hardc course among the things in the chatbot.
So we should never obey this particular law. This is this is because as in dictionary we have millions of words. So if you create the hard coding for each and every word of the of the dictionary then the time complexity for the retrieval becomes very huge.
So what you do is we have the embedding and embedding can tell us how similar does this sentence with the other one is.
So this reduces the time complexity and it and it also fetches us the appropriate result that is required by the model.
So this is the reason behind why you always go by the meaning and not by the keywords.
And you have the couple of options that you can use embeddings to store in is the chroma DB is the pineon DB and is the F and the EG vector. And you can check out that Python is used by notion AI perplexity and the F is the open source model by the Meta AI search. And on the right hand side you have the Python snippet by which you can embed a particular document.
You can see that in the docs dictionary uh sorry the docs list you can see that you have couple of sentences.
In the upcoming slide we will understand how this set of document can be converted into a embedding. Right now we have just taken an example of three sentences from the document that the refund text 5 to seven business days shipping cost and so on and so forth.
The the return is 30 days.
And next we take the query from the user that how long the refund will take. And since this query is very similar to the first doc sentence that is this one the refund takes 5 to 7 days. We have our answer right away. The model will directly give us the answer instead of just returning any of these other embeddings and so on. Now since this sentence is very close with a a particular sentence over here.
So in this way the model gives the written similarity result to the user.
So is this particular topic understood for everybody?
If yes, type yes in the chat. And if you did not understand, type LOL with the query.
Is this part clear for everyone?
Okay. So, let's move ahead with the next part is the rack. What exactly is the rag? Let's take an example that during our school time whenever our teacher ask us question and we may or may not be paying attention what the teacher was teaching and when the teacher askked us to stand to to tell us the answer. We fumble for a couple of words and we we try to act that we are thinking and we just uh send we just spell out some couple of words which may or may not similar to the answer that the teacher was expecting and suddenly our benchmate give us the exact answer in a in a very silent manner and we try to just flip up these words and change the grammatical format into active and pass in the active and passive voice and we tell the same answer which is given by our benchmate to the teacher and the the and the teacher says that hey this is the right answer. So this flow of information is what exactly happens with the rag.
So is this example clear for you? What exactly is the rag happens with the language model? We don't train these models again on our on our data.
Is this thing clear for everybody?
Okay, cool. So let's go back to the same information about our refund policy is that so if I have my organization on my own and there is some refund policy which is completely completely different just like in Zomemetto we have we have coupons format or we can just add a feedback and when I when I just type into chat GPD that hey uh this is the food from the Zomato and this is not what I like what is the refund policy so it will just give vague answers for this. As we have seen in the first lecture that model will try to say anything but the information the model said may or may not be true always. So with respect to my organization this refund policy is not applicable. It is completely different than what the model says. Over here to give model a definite information about my organization's refund policy. I try to build a rag model. It is just like creating our benchmate with the right set of answers and you can take it you can take you can visualize teachers as the users who is asking questions to the model which is us making a stand and asking the questions that hey what is the refund policy and then silently our benchm says hey this is the blah blah blah refund policy and then so on and so forth the model formulates the grammar for you and says what is the refund policy.
So is this overview of rag that is retrieable augmented generation clear for you?
A web search is a feature and web search uh the flow of information is different but it can be called as a rag.
As long as you are enhancing the information of the model that is you are adding uh more in-depth points rather than the generic answers from the internet live or you have in the database it is called as rag.
Yes, rag removes the hallucination.
Absolutely.
If you have understood this particular part, type in rag and if you didn't, type in lol with your query.
Yeah, go ahead. What do you have?
Yes, that retrieves the data from the database and it it enhances what the knowledge about the particular topic. topic the model has and it gives the output to the user.
Yes, notebook LM is the highest form of version of the rag.
Uh second one, what is the second one?
Can you give out the term and I can repeat it again?
Yes, it uses rag.
Well, when the model gives the answer of the prompt and sometimes not, uh it doesn't comes under uh rag. it can come under a customized system prompt. In the system prompt, you can configure what type of queries you can ask to the model and and which queries not. So over here the classification of uh the intent classifier of the prompt comes in when you write the instructions in the system prompt.
Well uh the working of notebook calendar we will be covering in the next session about the production design patterns of the agents over there we would we would be covering that.
So if you have any queries type out and if you don't we can move ahead with the next part.
Yes, rag is a very fundamental concept.
You can you can add web search databases to give accurate answers.
Come on, let's move to the next part that is the complete rag pipeline from very start to the very end to the answer to the teacher. Now the first step is load. Whenever you have a set of data in the form of PDF, in the form of markdown, web pages, database records or any uh Excel data, it must be in a form of words.
It should be a text.
Now you divide this text into small pieces.
uh you can chunk this data into paragraphs or some first 50 words then next 50 words and so on so forth the next 100 words basis on your application you chunk you divide the piece of the information into smaller chunks and then you embed this we can use any embedding models basis the usage uh if we are having some medical records that we can use bio and if you are having some legal documents you can use legal BERT but the BERT is the fundamental model to embed any textual content I'll repeat again whenever we have a document in the form of a text we divide this document into simple independent small documents and then we take a particular chunk we convert this chunk into embeddings.
Then we store this embedding into database. And what this database does is does it indexes all of the vectors.
So this part of indexes is done only by the database itself. And next is is the is the retrieval model. Now in the retrievable model we tend to check which particular sentence is similar to what the user has given and so on and so forth.
Is this part completed for everyone?
Have you understood this five steps before we jump onto the generate step?
Okay, cool. So, moving on to the generate part. What it does is when you retrieve the all the similar embeddings that you have say for example there are top three similar sentences that you have got for the refund policy and then whatever answer that I have got I'll send all of these three answers top answers to the language model and now basis my basis my system prompt I'll formulate my answer to the user appropriately this is how the gap between in the retrieve and the any generate is handled upon. Yes, database will become so large.
Is this complete pipeline understood?
Type yes in the chat. And if you didn't, enter lol with the query.
This is exactly same as our our benchm trying to tell us what is the exact answer and then we are formulating our own grammar. This is the chat and then we are guessing the answer and attending to the teacher and then then the teacher telling us hey is this right or wrong. This is the entire pipeline.
Okay repeat once short do it. when you have uh huge data of your organization, let's take about the refund policy. I have a document of a refund policy. Now, in order to put this entire document into my database, I'll divide this document into three to four subdocuments and then I'll convert this sub smaller documents into embeddings.
I'll use some small LM version two. We have various different embedding models. I'll just use any one of them and I'll convert this sub smaller documents into embeddings and I'll convert this embeddings and I'll store this into a vector database. And now what this database does is it will index all of the chunks so as the retrieval is faster.
Over here the job for indexing is done.
Now the next thing is whenever the user ask a query that hey when I'll get my money back. Now this sentence is kind of similar to the to to the policies of the refund for my organization. Now this user's query I'll check what is the most similar sentences in the in the database I have and I'll get suppose four to five similar sentences with the query that the user has entered and I'll get this particular chunk retrieved from the database and I'll send to the language model and this language model has top three similar sentences from the from the database and now this language language models will twist and turn the grammar of the sentences in a in a particular human readable format and it just gives out the answer.
So is this particular pipeline clear for you?
Cool. So let's move ahead the next part.
So what exactly happens inside the retrieval step in the retrieval step over here the step five. What exactly happens inside it? How do you retrieve these things?
Now we have various different retrieval algorithms in it. Now the most famous one is the cosine similarity. Now in our 11 to 12 standard we understood about uh some scalers and vectors and we also understood about the dot product and the cross product.
I hope this thing is remembered by everyone.
What is the dot product and what is the cross product?
So to find a distance between two vectors we use dot product. This is the exact same what we have studied in 11.12 standard. The exact physics is used over here.
The same mathematics to find the distance is used over here. Same the vector which we have studied in in our 11th standard is the same vector over here. It's just that you are converting words into vectors. That's it.
So the most famous similarity search algorithm is the cosine similarity. It tries to search and find what is the angle between the vectors. If the angle is less or closer to zero then we can say that hey these two embeddings or the vectors are same and if the angle is obtuse angle it is greater than 90 and it is far far away than zero then we can say that hey these embeddings of vectors are not so close and one of the greatest example is cosine similarity is used by perplexity we all have used perplexity or spam in the chat. If you have used and experienced the perlexity, how accurately the factual information it gives, it is a very good information and example of a rag.
Awesome. Now you must be wondering why only we take the first k similar embeddings or first k similar sentences.
Now we try not to confuse the language model. It is such that your benchmate is trying to tell you all of the points of the answer into one single go. Now our now our teacher is expecting some five points and suppose your benchmate gave you some 20 points and you got completely confused. Hey, which point should I say first or which point is accurately right as per the teacher. It could be the second point, it could be the third point. Now in the same way model will get confused.
So that's why we take the first K from 3 to 5. Five is the five is the least sweet spot and we generally take case two or three most similar sentences and we then deliver this similar sentences to the model.
Is this particular concept of sweet spot for why K is equals to 3 and 5 is often chosen by the model and then it is fed to the model by the algorithm.
Is this particular clear? Why do we always take three, two or even one in some cases?
Yes, even in at GPT we have web search as well because in GPT 3.5 earlier it did not had any web search from GPT4 it came up.
Now let's move further in the context window injection. Now whenever the teacher asks the question during the conversation between the teacher and the student to whom which the question is asked the benchmate tries to prompt between what the teacher is saying and the interaction that is going between the teacher and student. In a similar way we have the chunk one, chunk two and chunk three. And here is our user's question. And then we embed this chunk in between the processing of the model.
And then it formulates the answer and by by formulating and fixing its own grammar and then delivering the answer and so on and so forth. Next is the reindexing problem. Say for an example uh my refund policy is now changed and the older refund policy uh the terms and conditions is same but I have some new pointers to add in my in my refund policy. Now what I'll do is I'll try to rembed the data in the database itself. I'll follow the same structure which we have seen the same six the same six milestones I'll do it again for the new set of data now over here the database needs to reindex so as to refresh the data is it is because when our friend that is our benchmate has older information about the refund policy then the answer becomes completely wrong then the teacher may punish us bases the answer that we have given which is the old refund policy and now whenever we have a new information and we tend to embed this new information into a database it needs to reindex and when we reindex our our database entirely it gives us the fresh new information whenever we give some queries to the model so is This complete four parts understood what are the retrieval steps.
If you understood type retrieve and if you did not type lol with the query.
Well, when perplexity has an understanding about hey this is the complete factual information about a particular query that the user has entered then it shuts down its loop that hey now this information is complete I'll give it to the user now chatgility is trained in such a such a way that it should be very useful and very friendly to the model and you might have seen that the chat GPT is basic problem, basic introduction, then next one to two sentences in a deep and then last sentence would be shall I explore some more information about this or not. This is the hierarchy of the uh presentable data that has been shown to the user by the at GPT.
Is this thing understood for you?
Yes, we will share it at the end not right now.
So let's move ahead with the next part is the how do you chunk a piece of document before embedding into the databases.
Now suppose at my organization I have a document about refund policy. Now my refund policy is approximately thousand words.
Now we have three choices either go by fixed size chunking or by sentence or paragraph wise chunking or semantic chunking. In the first fixed size I'll just take uh first 100 words as the first chunk. Then from 101 word to the 200th word as the second chunk and so on and so forth I'll have total 10 chunks and I didn't pay um pay much attention that whenever I chunking from 100th word to the 101 word there is no meaning that is being cut off from the first chunk and the second chunk I'm not paying much attention to this. I'm I'm just moving my window of the fixed size all the way to the end and I'm getting this 10 chunks and I'm embedding into the the database.
Next is the sentence or paragraph in my thousand words document. I might have some um I might have some paragraphs.
Say for example, I have total five paragraph. And we all know that the meaning of a paragraph stays intact in it until until and unless there is a slight connection between the two paragraphs and I'll just take paragraph as a one single chunk. If there are four or five paragraph then accordingly four to five chunks will be created. Next is the semantic chunking. Let's take the same example in the in the five paragraph refund policy document. If the first three paragraphs are same or trying to convey the same meaning with much in-depth information then I'll create one single chunk of the three paragraphs and I'll try to see the next three or next two chunks have the same meaning or not. So basis the meaning bases the sentence or paragraph or bas is the window size I can chunk all of them. So is this chunking strategy understood? If you have understood type chunk and if you didn't press or type LOL with the query, we have a long way to go people.
So in the next we would be going slightly faster and I would be super happy if you have been really cooperative with the uh queries doubts and with the spamming in the chat.
Okay. In the semantic chunking, say for example in my refund policy document, I have five paragraphs and say for an example in the first three paragraphs, they try to show the same meaning about the refund policy and then and the next two paragraphs is giving out a different meaning or different domain of the domain meaning about why a particular refund policy has it. And the first three paragraphs are kind of terms and conditions of the refund policy. Now I'll create a one single chunk of these three paragraphs since they have a same set of meaning that is the terms and condition of the refund policy and next two paragraphs they just they just talk about what exactly is this refund policy and so on and so forth that is completely different meaning from the first three paragraphs. I'll then create a chunk one of the of first three paragraphs and chunk two of the later two paragraphs bases the meaning. So is this clear? Why chunking?
Why do we use semantic chunking in this particular example of refund policy?
Yes, to make the message very clear, right?
So, let's move on to the next part that is the naive rag and versus the production or the advanced rag. In the nive rag, it just follows a very simple simple flow. Embed all the documents. Ask the user what the query is. Then convert the user's query into embeddings. Trying to see what are the top three similarities sentences of the user query from the from the database. Fetch all the three similar embeddings and then share this information with the model. And this is the end of the entire life cycle of the knives rag.
Next is the the advanced rack. What it does is whenever you retrieve the facts top three facts it it tries to recheck is this three facts exactly the facts that the user wants to see or not.
It's like the answer given by the user which is being shared by our benchmate.
Then this then this teacher will try to check does the answer given by the student is right or not. This teacher will try to go inside the textbook and and the teacher will surf is this answer completely right given by the student or not. This is called as multi hopping. And then whenever I'll have a new set of document and I'll just re-mbed the document so as to create more embeddings into into the document and since then I have a new information.
I'll try to reankindex the database so that I can get the best top three results whenever I'll have a query or not. So this is the basic difference between the production level rack that is the perplexity and is the and the next is the knack which is we often uh create for our projects and so on. So is this a subtle difference understood of naive ra and the advanced rag?
Yes. Rag gives the accurate answers when you create some chunks.
We will dive more deep into the advanced rack and how do you actually think and build your own advanced rack into the next session. So this is the couple of concepts that is used by the advanced rack.
If this is clear uh shall we move ahead or if you have any queries type in lol with a query.
Let's move ahead the next part that is dag is not just a feature.
Many people tend to have an understanding is that since you give more fine information of a product or your organization's refund policy, we tend to see it as hey this model has a very good knowledge of your refund policy. We might be wondering that this model might be fine-tuned on your database, but it's not the scenario.
Well, you are just appending the right set of information before the model throws its generic answer. It's just a wrapper built around a model.
Well, is this kind of wrapper of the rag understood? What exactly is the rag? We tend to see it as uh rag is the next step for fine-tuning a language model.
Next is the the knowledge graph. This is far more better than the rag model is that.
So let's take an example is that uh let's take example of a person Rahul who is currently working at Smartly and is a part of GSO 2026 and Rahul is building a Python app which uses fast API. Now you can see the relation between the smartly JSO Python app and the fast API with the Rahul how this information has a hierarchy now. Now there is no embedding over here as of now. We will have embedding later on.
So is this graph completely visualizable in your mind by you?
So you can actually see how this information is kind of linked with your own set of document. So it's like the same information as the refund policy. Now if is there any fact with the refund policy connected with your items in the bag or or whatever you are purchasing. For example, in blinket uh on chips or some uh chocolates, we have the refund policy.
But if the item is open like uh any soft drinks or if if you opened it, then refund policy is not applicable to them.
So this kind of interlinking with a particular document we call it as the knowledge graphs.
So is the concept of knowledge graph clear with everyone? How the set of information which are not similar to one one each other is kind of linked with the same document.
Is this concept clear for everyone?
We try to create relationship between the information.
Cool. Let's move ahead with the next part that is what is the memory life cycle.
Now we have couple of steps over here for the entire memory life cycle is that so there's a difference between a memory and the rag document. Rag document is not a memory. It is a document created by the user for the model to give enhanced information to the user. Now memory we have seen in the previous slide is that we have four types of memories and the model itself creates a creates its own memories just like how our brain tries to remember some sort of information of an accident or some or some incidents and so on and so on and so forth.
What the model does it it first observes what is the data given by the user and it tries to extract some facts just like our brain observes the situation and tries to check which all facts it needs to remember what decision it took what task it did to to go ahead with the goal and what are the failures that I have encountered and so on and so forth the actions I'll try to take out what what are the important decisions actions I made and then I'll score to a particular action.
For example, if I'm drinking a water, I'm just there are two ways to drink a water. One is just by getting my hand closer with the mouth and the and the second way is getting my hand around my head and then getting closer with the mouth. So there are two different ways and I'll try to give scoring with this two ways and the one with going around my head and getting close closer to the mouth I'll give less confidence score less freshness because it is tedious and I'll do once in a month or or two and the and for the decision I'll give the maximum score is the simple one by just get bringing the class to closer to my mouth. This is how I'll give a score. The scoring algorithm will differ basis your product or what feature are you trying to design and what goal that you will achieve.
Next is store. We tend to use vector databases to store the relationships or the knowledge graph of the information how diversely it is linked with a particular set of data.
Next is the retrieval. we tend to retrieve the data filter for a particular user or any permissioned user or what similar data we have or if there's any huge hierarchy of a information and if I'm just a student and and I should know a particular chunk of a densely hierarchy of data and this is the student I'll only need to see a small chunk of the entire data so this retrie retrieval this this filteration of the entire data and then showing to the user comes under the retrievable chunk. Next is the update.
Whenever there is a new information or new document coming up then you need to update all of the relationships in the knowledge graph. You need to reindex everything so on and so forth falls under the update graph.
So is this life cycle understood of a memory? This is not a rag. This might be very similar to the process of a rag.
But this is a memory. You are trying to store a pieces of information of an incident into a database. It is completely philosophical. It is same as how a human brain stores a piece of information.
So is this thing clear? Type retrieve if you have completely understood this topic and type lol with your query if you did not get this particular slide.
No, you will always use vector databases because in posgress pro posgress SQL how are you going to store the data? If in if in posgress you have uh vector as a data type then you can surely go ahead with it because pine cone and chroma they are they are optimized for retrievalss they have built-in cosine similarity BM25 algorithms and so on and so forth because in posgress SQL you you need to write your own retrieval logic so it's The difference between the chroma and the PL/SQL understood postgress SQL spam retrieve everyone because we have already exhausted our time and we have couple more slides to complete.
So let's move ahead with the next part that is when does a rag fails.
Now you must be wondering that hey the refund policy given by the model is accurate. I have the find data into the database but still why is this thing completely failing?
Now there are certain parameters basis which this this thing can fail. Now basis on your goal. Suppose you have a prescription data or some medical data and you are using the wrong embedding model and you have used the traditional bird model and it has given you the embeddings. But when you insert this embedding into the database and you have the most optimized search algorithm to do it but even after getting the top three or top two results the data retrieved from the database is is not quite accurate. Now the fault is due to the embedding model you have used. Now basis the usage or the goal you need to tackle you need to choose the appropriate bird model for medical purpose you can use biome bird or for legal purpose you can use legal bird and so on and so forth well you can fine-tune a bird model as well there are some kaggle uh notebooks jupyter notebooks to fine-tune the bird model based on your use cases so if I have a refund policy document which is approximately of thousand pages then I can fine-tune the bird model for my use case and and it can give me much more better embeddings than the traditional bird model.
Next is the chunks are too large. So when you have a thousandpage document and you create uh one page as a chunk and one page has approx approximately around 1,000 to 2,000 words. Well, since chunking is small, the embeddings uh related to the chunks will create it in such a way that it will create a particular domain. Such a way that if there's a relation between the page one and page two, it will never get captured into the embeddings and you will tend to lose information which is on the page one and on the page two.
So this is the fault of chunking. So you can use semantic chunkings or some relational chunkings so on and so forth to tackle this information.
And similarly when I have my refund policies changed but I haven't refreshed my database with the new information then the model will use the data which is in the database itself and it will never give the new answer which is expected by the user. And whenever we have the new information, the database should try to reindex and rerank the entire data it has since since the data is now refreshed.
So do you understand what are the four points for the model to fail of the rag? If you have completely understood about this, type in rag. If you didn't, type in lol with the query.
Awesome.
Let's move on. Let's go with the next part is when rag needs a safety and why do we need a safety check for the rags?
For example, in my refund policy document, there are some sensitive information. I have my password or if I have my private key of a bitcoin wallet which has some,000 to 2,000 bitcoins in it and this information is in my document. So whenever the user tries to query my platform, this sensitive information will be shown to the user directly. So this indeed compromises my entire data.
So you should be always careful what information is embedded into the database. So you should always check the documents of a particular domain for now as refund policy before creating and embedding in it. Now whenever the model uh answers to the query of the user, the model should always give citations. What is meant by citations? It will uh it will add a link stating that hey this is the sentence from the entire document I have got which answers your query.
So this is called as citations.
And next is the freshness aware.
Whenever there is a new data adding into the knowledge graph or a database, the database should always reindex and reranking so that the freshness stays alive and the user gets the new information.
Next is the groundless. Groundless citations are the same. It is it says that whatever information is answered to the user the same exact meaning should always present in the database in the retrieved context and so on.
So is this particular steps understood when you are building some rag applications. These are the step steps that you need to be aware when you are building the the rag application. This is from a security point of view.
Is this part completely clear for you?
Type in yes if you have or if you didn't type in lol with your query.
Awesome.
Let's move ahead with the next part.
Next is since you have understood the complete pipeline of a rack and how to build it. Now next is what are the right set of frameworks that you need to choose to build a rack.
Next is the llama index.
This will create you indexes. It single-handedly handles the loading, chunking, embedding, indexing and querying of your entire document.
Next is our famous library as lang chain. It has any number of tools and it's using this function you can attach any number of databases. You can add chromb, pine db, pl/sql, pgsql and so on and so forth. Next is the chromb.
Now chromadb is generally used locally when you are going to develop this rag model and you will be switching to pine cone which which under the hood uses the chromb itself and the funny thing is pine cone is is developed and open sourced by Google.
Next is the memo which is the plug-andplay agentic memory. You just need to drag and drop your files and just like llama index it handles the loading, chunking, indexing, embedding, so on and so forth all by itself.
So let me know which framework are you going to use to build your next rag model and here are some set of ideas that you need to take and you can build your own rag rag models on top of it. These are some fundamental rag models which you can build as a side project, pet projects and so on.
So with this I conclude our session. If you have any queries drop it in the chat. I'll happy to answer them.
If there are queries, shoot your queries in the chat. I'm happy to answer them.
Yes, in M you can use the rag. Rag is just a concept. It is not limited to any frameworks or so on. In m uh langchain and langraph has a distribution in javascript typescript as well. So you can use it.
If you have any basic queries, feel free to ask. If you are skipping your queries then it will came up in the next track and you will be left behind.
More and more simple queries you ask the more concept clarity you will have.
Well, the contributing guide will be covered in a different session. Not right now.
Well, if there's no queries, then we can definitely wrap up this session. So, if you have any queries, feel free to type in the chat.
Well, when you build any projects, all of your concepts will get clear. Uh when you are building any rag applications, all of these concepts will be used.
We will dive deep into how you can think and how you can create your rag and using some production level design patterns into the upcoming session.
So if there are no queries then we can definitely wrap up with the session three and in the session four you will understand about how you can think to build a drag application. What are the set of design systems that you can use?
what pattern we need to take into consideration while building the rag applications or the most importantly the agentic applications.
Yes, in the next session there will be a guide about projects how you can build it.
So with this uh let's wrap up with our session three.
Thank you so much everyone.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











