Agentic RAG systems can be significantly improved by adding tools that allow agents to look up specific FAQ questions, rather than embedding all answers directly in the context window; this approach prevents context window overflow while enabling agents to retrieve precise answers on-demand, and can be combined with vector database memory and short-term memory for comprehensive question answering.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Taking Agentic RAG to the next levelAdded:
I love agentic rag rather more than traditional rag. And in this short and sweet video, I'm going to show you exactly why. I'm going to take an existing agentic rag with MCP project I did recently and I'm going to make it better. I'm going to make it more agentic. Let me show you. I'm in a repo called expert. I'm going to put a link in the description. You can just watch what I do or you can follow along in the code if you wish. And I'm going to start by going to nextlevel.ipine notebook which is where I will show you what I'm cooking. And this is an agentic ride project which I built out recently.
Something that can answer questions about my courses. It's kind of selfishly about me and you should make a version of this that's about you. Make it specialize on your domain. But now I'm going to make it even more of an expert.
The previous version was able to answer questions because it took a lot of content and it turned it into chunks and it put it in a vector data store using an MCP server. Tons of buzzwords in that last sentence, but we did it and it worked. But it was okay. It still had a few flaws. If you tried it out yourself, you'll see that it got stuck sometimes.
And the way to make it tighter is to add a tool that gives it more concrete information. And we're going to start with a list of specific questions and answers that we want to equip our agent with. And if you look in the knowledge directory on the left, you'll see that there is a new file called FAQ.json.
And this has a bunch of JSON lines, JSON L. Each line represents a different question. an answer that I have taken from my website and I actually just asked Claude code to create this JSON L file based on scraping my website and you could do something similar to make a JSON L file with this format answering any kinds of questions that you want to equip your agentic rag system with.
Okay, let's get started in next level.ipine notebook. I'm running some imports. I'm going to use the model GPT 5.5, but you can use GPT 54 mini if you prefer, or you could use Cerebrass like we did in the last video if you like Cerebrass. So, we're using GPT 5.5.
Then, we've got some constants to set up about this these directories here. And we're going to use an MCP server. We just set the parameters to say we want to use the quadrant MCP server to be able to retrieve memories from our local vector database that's sitting right here that we set up last time. And if you didn't do it, you can just run this ingest notebook to set up your local database. So what I'm now going to do is just use general cudrant code to look at this vector database and see what's inside it. We set it up last time based on scraping the web and using the MCP server to actually write to the cudrant database. And there's now 190 different chunks of knowledge about me stored there, about my courses, about my background, and everything else. and hopefully you've got something that pertains to you or to a subject of your choosing. Okay, let's now supplement it with some specific questions and answers. The JSON L file is a file where each line is a JSON document. So I can just read that into a variable called FAQs just by iterating through loading each line in turn parsing the JSON. And now I can use this. This is now a set of FAQ with a question and answer. And I can use this as I develop my system prompt, my instructions. And here it is.
It's in the form of some markdown. It's got a heading roll, a heading memory, and a heading FAQ. Let's go through it.
The RO section sets the scene. Tells it that is answering questions about me and my courses. The memory section says always use your cudrant find memory tool to find relevant information. That is the MCP server. And now the FAQ tool. So look, here's the thing. The FAQ is quite long and all of the answers. There's a lot of answers and they're several pages long some of them. We couldn't fit all of that in the context because we would overwhelm the LLM. We might use up all the context window. It would be expensive and it would be too much irrelevant information. So, this is the trick. We're going to tell the LLM the questions themselves and we're going to give it a tool so that it can look up the answer to any of the questions should it wish to. That's the trick. So we explain that clearly here and then we list out each of the questions by number. We iterate through the FAQs and we put in the question number and then the question and that is setting up our LLM. And let's see how it looks. Here it is. This is what our prompt looks like when I print it out. It says uh the role, the memory, the FAQ, and then it lists out all of the questions by number. It's still quite a lot, but believe me, it's a lot less than if we also had all of the answers in there.
Okay, now we're ready to build our new agentic rag. We'll start with a couple of quick utilities. This line here is an elegant little dictionary comprehension, which is a Python trick to turn this FAQ's list into a nice dictionary where the key is the question number and the value is the FAQ itself. And then I can just have a little function find FAQ, which if it's given a question number, it will return the question and the answer. And then we can use the beautiful OpenAI agents SDK function tool decorator which just simply allows us to turn this function into a tool with this dock string as the comment as the description that will be provided to the LLM to tell it when to use this tool. Use this tool to retrieve the answer to a frequently asked question by its number. You pass in a number, you get back the question. Okay, let's run that and see how it works. So I can do something like adding in some code and I can say find FAQ uh number three and we'll see what that is and I run that and we get back the results of running that that question. Now if instead of that I do faq tool which is the the decorated function we don't get back a function we get back a function tool because that's what the openai agents SDK gives us and it has a description right there and it also has a params JSON schema and if I run that we get to see the JSON that will be provided in the system prompt to the LLM in order for it to know how to use this tool how to pass in a question number which needs to be an integer and we can also do faq tool.escription description to get the description of the tool that will be sent to the LLM and it's taken exactly from this dock string right there. You can see that's how OpenAI agents SDK has done it simply by using this decorator app function tool. We have turned this function into a tool that we can equip our agent with. And now we'll just have some example questions that we'll test it on. And we'll use this short-term memory, this SQL light session. Another nice feature of the OpenAI agents SDK that lets us have a short-term memory. And with that, we are ready to build our final agentic rag product. We're about to write a chat function that will be used to call our agent, but I want one little extra piece in there. Sometimes LLMs are not the right solution. And here's one case in particular. If the user just types out a question like Q3 because they just want to know the answer to my third FAQ question, I don't need to go to an LLM to do that. I can just look it up and give them the answer. And we can just do that with code. So let's do it with code. So I've got this little function here has instant answer which if a user sends a message it sees whether that message is just like Q1 Q2 and if so get instant answer will simply look that up in the FAQ and return it. So look if I do like get instant answer and I do get instant answer for for Q2 say then we instantly get question two right away here and that is what we're going to use in our flow in our chat function and there's only about three cells to go.
This cell is a janky cell. One of the things I dislike about OpenAI agents SDK and I like almost everything but one of the things I dislike is there's some sort of hacky stuff you have to do if you're streaming back results from the agent and you want to decide how to print it. you have to do this sort of is instancy tests with the events that come back. And so I've packaged all of this into one useful function so that it figures out what's going on. If the LLM is trying to call a tool, it can print that. If the LLM is trying to stream back a bit of data, it can print that.
So this is a little bit janky. You can look through it if you wish, but it's not super important. And now we're going to get to the main function, which is going to call our agent. This run function takes a message from the user and it will stream back the results. It uses an MCP server. So we have this async with MCP server stddio and we're passing in the vector store parameters.
So it's going to use the cudrant memory.
We then create our new agent. The name is expert. We use the model we defined above the instructions that we defined.
We give it access to the tool, the FAQ tool, so it can look up the answer to a question if it recognizes a question.
And we give it access to the MCP server to retrieve memories. And then we call runner.run. But we're doing it a bit differently. We're calling runner.runstreamed.
So it will stream back results. We tell it the agent. We tell it the message from the user. And we give it the shortterm memory in this object combo that we set above. And then simply we take the results and we loop through it.
We're streaming back results. And for each one we use this jank key function here to strip out what we need to know.
And then we yield the total results so far. That's the standard pattern for streaming back results and showing them in gradio. So I run that and we end up with the gradio chat function. And it's just going to say, look, the users just asked me a message. Let me check if I have an instant answer that doesn't even need an LLM at all. And if so, I'm just going to return it. Otherwise, I will call the run function, the LLM, the agent. I will stream back results and show them in gradio. That is our chat function. And without further ado, we're ready to see how it looks. And it all comes down to this. We create a chat interface. We give it the chat callback function. We give it some CSS. And I didn't write that CSS myself cuz I'm terrible at writing CSS, but of course Claude Code wrote it. And Claude Co is a genius at CSS. So let's bring up this user interface. Let's see how it looks.
Great. It looks like it did last time.
Really nice. Let's see which course covers rag. Let's call the model with that. We're calling GPT 5.5 which means it's not going to be as speedy as Cerebrass was unless you've chosen Cerebrass. You'll see that it's called CQRant fine a couple of times. It's called the memory and then it returned those tools and we got back the right answer. It used its memory correctly.
And now I've started it again. I'm going to ask it what job can I get after I take your courses. And that happens to be a question from the FAQ. And hopefully it's going to realize that.
Look it up. Yes, it's calling the FAQ tool and it's responding with exactly the information from my FAQ and it's included links to the curriculum and details. A nice comprehensive answer.
You can see it used both the cudrant tool to get its memory at the same time as the FAQ tool. Both tools run in parallel so that a gentic rag could get the most possible information and then give a good comprehensive answer. This is great. And as the final example, I love to say to people, sometimes LLMs are not the answer. And indeed you can just type Q37 for example if you are interested in how do you go about setting up a new project from scratch with things like UV and bam instantly comes back the answer because it knows just to respond immediately with the contents of question 37. This also goes to show that I couldn't really have populated the context in advance with all the Q&A. It would just be way too long. And that's why the tool works so well for us in this case. And that brings together our agentic rag. We now have something that can use vector lookup in long-term memory. It's also got short-term memory. And now it also has a tool to look up FAQ. Plus, we sort of bypass the whole LLM if we can and just provide the answer instantly. That is my new expert, my new question answer. And if you didn't already love aic rag, then I certainly hope you do now. And please do apply this to your domain. Build your own agentic rag. And show your love for aentic rag by liking and subscribing to this video. Please.
That's the way I know that you like these. You want me to make more of them.
And I hope to see you very soon for another video.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











