Install our extension to search inside any video instantly.

We built a RAG system for a real client
Added: 2026-05-08

2,975 views13029:33codebasicsOriginal Release: 2026-05-01

This video strips away the AI hype to reveal that production-grade RAG is essentially high-stakes data plumbing. It proves that the "intelligence" in these systems depends less on the model and more on the manual engineering of document structures.

[00:00:00]In real life we have to use this kind of mapping this kind of huristic custom approaches to retrieve the relevant context. If the context is out of context window you have to do pruning summarization all those techniques.

[00:00:14]How do we reduce the manual labor of these junior associates so that they could like work on higher level tasks.

[00:00:22]Today I'm going to talk about how we built an AI project which was on retrieval augmented generation for one of our clients in our company at lake. I will give you a glimpse of the initial discussions that happen. We'll go over technical architecture. We'll even go over some of the code as well. This way you will get an understanding on how AI projects are built in real life. If you're a student this will help you with your project portfolio. So it increases your chances of job success and if you are an AI engineer working for a company building some projects uh you will get uh some understanding on some of the challenges that we faced and how we overcome it. Let's go over the problem statement for this project and I'm going to fully respect the non-disclosure agreement that we had with our client.

[00:01:13]So we are not going to share the real client name or any sensitive information. I have changed the uh domain a little bit but the fundamental AI engineering principles are still the same. Now there are many immigration law firms in US. Okay, US has many immigration law companies and they help their clients with various matters like someone wants to file H1B visa, someone wants to go there on family immigration, you know, asylum. There are so many cases that these law firms fight. Now, Harishian and Kim Law Group is one such law firm and they have multiple attorneys. Let's say Bob, Julie and Moan are the attorneys working in this particular law firm and they have this particular client. Actually, I used to be the client because I was having H1B visa status in US at some point and uh I actually consulted one of these kind of firm to file EAD which is employment or authorization for my wife who was on H4 which is a dependent visa. So when I go to this firm Harrison and Kim law firm they will assign a specific attorney to me. So let's say Julie is an attorney in this firm. See Julie here, she's fighting my case. Okay, she's helping me actually file EAD for my wife who is on H4. Now Julie will have a couple of assistants as well. So Julie will have her own little team. Let's say Mike and Abdul are helping her and the job of this entire team is to look into the USCIS policy documents. So which is basically USCIS is the entity which takes care of the immigration matter in US and they have all the rules and regulations and by the way these rules and regulation keep on updating every year or every few months. So they have to constantly keep an eye on latest rules and regulation and have a complete indepth knowledge. They also need to look into thousands of cases which were going on right they have dealt with hundreds and thousands of cases also on the website you will find so many different cases. So this is their knowledge base. Okay. So this is their knowledge base and this entire team will help a client like myself using this particular knowledge base. Now the problem here is that there are millions of such cases. the rules uh are constantly changing and for this team it is hard to keep up. Okay, it is hard to remember. We are humans of after all we might forget certain things. We don't have capacity to uh comprehend thousands and millions of cases which which are going on or or which have already completed in the past. Okay. So this is a big problem when it comes to knowledge retrieval and we all know that using AI using retrieval augmented generation we can solve this problem and we can build a chatboard which looks something like this. So whenever you are asking any question this chatboard has knowledge of uh the latest laws USCIS laws or policies plus it has knowledge of all the cases individual cases you know which are being fought and based on that this will generate a comprehensive answer. Not only that it will also create a references. See it will create a reference to the legal case. It will also create references to the old cases which are being fought. Okay. So legal clause as well as the old cases which are being fought. So Julie will use this tool. Okay. To retrieve the information faster and she can also refer to these cases. In the previous approach the process was manual and it was leading to two kind of problems. Okay. The first problem is delay. because you have to do so much research and the second problem is it is not accurate. It might make mistake right like your human team might make mistake. So uh accuracy was a problem accuracy was very low. Their aim of building this AI system is to improve on both the fronts. You know they want to reduce the delay and they want to improve the accuracy. Now folks, here is the USCIS website which has all the rules and regulation. You can search for it. Let's say for example H4 and here you will find all the detail rules and everything. Okay, let's say you want to know something about changing to F or M student status. They have all the rules here. Okay. So what this firm will do is in order to supply the data they will create this kind of PDF file where you have sections. This section is for H4 dependent spouses.

[00:06:23]Then you can have section two for non-immigration F and M student related information etc. So this is a parent section. Each parent section has child clauses. Okay. Clause 1.1 1.2 1.3 and so on. Okay. So this is let's say your database your this is the the PDF file.

[00:06:43]So this is one type of one category of your knowledge. What is the other category? Well the cases okay so you can go to the the website which contains the repository of the last cases. So let's say H4 EAD I will find all the cases everything on this H4 EAD. Okay. So let's say if I look into it uh there will be detailed information on this particular case. Okay. So here for example okay there'll be detailed information. Now of course see these cases are so long manually searching through it and comprehending the information is pretty difficult. So we have this two knowledge sources. Before we continue let me take a moment to thank today's sponsor progress. ChatgPT is good for internal knowledge but it doesn't know anything about your private enterprise data. You can build a rag solution but it has its own limitations such as no verification, no orchestration, no enterprise controls.

[00:07:42]The progress agentic rag solves all these issues. You build a knowledge box by dropping in your PDFs, DAX, videos or syncing it from SharePoint, Google Drive or S3. From there, data augmentation agents enrich and extract relationship from your content. With 30 plus customized retrieval strategies, it always pulls the most relevant context.

[00:08:06]Then when you're ready, you can directly taste the agentic rag search functionality and fine-tune the search configuration to match your use case.

[00:08:14]You can also track rack performance metrics so that you're not just shipping something and hoping it works. You can see what's actually working. If trust, accuracy, and traceability matter for your AI use case, check the link in the video description below. All right, back to our video now. One is loss. Second one is cases. If you have built rag pipeline using some of the YouTube videos etc., you would have looked into only one kind of collection or one kind of knowledge source. Here there are multiple. And what happens is you need to chunk this. You need to uh create embeddings for this knowledge and you store it in a vector database. Right? So this is something you will store it in your vector database after creating chunks and also embedding. This is also something that you will store it in your vector database after chunking etc. Now here the normal chunking technique, normal retrieval technique is not going to work. Let's say you you store everything in a in a let's say single vector database. Let's say this is a single vector database. Okay. And when the user query comes in and when you try to retrieve uh using this common knowledge, one problem that will happen is let's say your top k is equal to 5.

[00:09:39]You are selecting first five chunks and let's say what if all the five chunks are from the cases. Actually you want to look at actual law also. Case is a individual story individual event. Law is a generic law. So you want to look into law as well as cases both. But if you have a single vector database and if you're doing top K it might happen that all the everything that you retrieve is from cases or let's say everything you retrieve is from the loss. That is not good. Right? you need a balance between the two. So what if you retrieve the chunks from laws as well as you retrieve from cases and then you do some kind of merge. Okay. So what we will do is this.

[00:10:24]You will first store the laws and the cases into vector database by creating this kind of embedding. Now let's say somebody asks a question on H4 A okay what is the process of filing filing H4 E let's say somebody is asking this question so we will do the semantic search on this laws first all the chunks which are stored in this vector database we will do semantic search and Let's say top K is set to two. So you will get two chunks. Correct? Or let's say you top K is set to one. You are getting just one chunk back. Okay? And that one chunk can be in a child section here. Let's say that chunk is here. Let's say I'm asking question related to AC21. Okay. what is the process of filing H4 EAD with uh let's say I'm mentioning something related to AC21 here. So naturally it is going to find this chunk let's say this is the chunk. Now what we do is instead of using fixed chunk etc uh because fixed chunking and even some of the standard chunking techniques can create problems and this is the story that we need to learn from the real life projects that we need to come up with sometimes custom hybrid chunking strategy which is what we did for this project. So here what we did is to improve the quality of the retrieved answer we took this chunk and for this chunk whatever was the relevant clause we took the whole thing. Okay. And to improve the context see uh this is how to apply but then overview of eligibility is also kind of important.

[00:12:26]So what we did is okay we went to a parent section. So using hierarchical chunking you can get to a parent section. So for this clause we know that this is the parent. Okay. So in our hybrid technique if the retrieved chunk was this we actually took the entire parent section as the retrieved chunk. So see this is a child section. Okay. So this child section let's say let me put it here side by side. So this child chunk is nothing but this one. Okay. So let's say this child chunk is nothing but this particular block that you see here is that particular child chunk. Now what we do is we extend this child chunk to the entire parent. So what is the entire parent? Entire parent is section one.

[00:13:30]This whole thing you know whole thing.

[00:13:33]So then we expand it. We expand it and it of course contains that child strength two but it contains the other clauses and everything. So this is the entire section one. So this is our retrieved segment. Okay. And then what we do is we need to look into this another vector database where we have cases. So for that we will uh use the technique of of course we will search through the text but we will also search through the sections which are mentioned here. So see here uh or the clauses. So here let's say they mention clause 106.

[00:14:18]So now all the clauses clause 106 1.1 1.2 clause 106b we search using those clauses into our this second vector database. So second vector database is our cases. Okay. So we amend this section. we improve that section by expanding it to parent section first and then adding all the clauses. So here in this particular section we will say clause 106 A 106B search for all these clauses and search for some of the retrieval text here into this vector database and retrieve me the most top five uh relevant cases. Okay. So let me use a different color for this. So let's say the retrieved case is you know case one case two case three or let's say chunk one chunk two chunk three now this plus this see this entire thing plus this becomes your final context so this is your final context okay and using this final context I mean I'm assuming that you have some basic knowledge on rag for this okay so using this final context. Okay, this is the final context. This whole thing is a final context. So this is my final context and you will prepare your question. You will say okay what is the filing process for H4 with some AC21 clause. So that is your question and that you give it to LLM and LLM will give you the answer.

[00:16:13]Now see the context can become big. So in that case we will measure the context length and we can do some pruning from the bottom. So these two cases are let's say top most relevant cases and third one is the little less relevant. So then if the the final context goes out of the context window of LLM we will do some pruning from the bottom. So bottom pruning means from the bottom we can remove some of this and we can give then that context. You can do context summarization as well but in our project we use the context pruning technique to remove this context and give it to LLM so that it generates the answer. Okay folks, so that was the technical architecture. Now uh let me go and talk to the AI engineer who worked on this and we'll discuss some of the challenges that they faced and the technical tax tag. So folks we have Shriant here who is our lead AI engineer. So he worked on this project.

[00:17:16]So this is the architecture diagram which I just explained. We are going to go over technical stack here. So Shikant I want to know for each of these pieces like what kind of technology and tools did you use? So the entire LLM orchestration as a LLM framework we use lang chain >> okay >> and for the vector database we used mil these are not two different vector data >> not two different there are two collections >> okay so folks it's the same vector database but there are two different collection one for laws and the other one for cases >> exactly and we used um posgress SQL >> okay >> for data base >> okay for these databases Yes. Okay. And this is the database that client gave you or you did any kind of web scrapping?

[00:18:04]>> No, client gave the data because he also did the data sanitization. So because this data is available but again it is spread across everywhere, right? So they compile the data, they clean the data so that it is ready to use.

[00:18:17]>> Okay. So before it came to you, they also did some technical work. So folks, this is something we have to remember that you don't get clean data all the time. So our client got the data somehow they did cleaning and then they gave it to us. We stored that into posgress.

[00:18:35]>> Yes.

[00:18:35]>> Okay. Any other technical architecture component like what LLM did you use? You want to mention anything?

[00:18:41]>> So since this is a publicly available data uh there is no such thing like data complaints. So we used GPT models GPD 40 at that time. But on top of that we also placed guardrails because since uh there will be some sensitive information as well right and we don't want to leak that information to models. So we implemented guardrails as well how we can protect the PII and also since it is about communicating to models right we also need to have content moderation.

[00:19:09]>> So how we prevent prompt injections and everything. So these guardrails are the safety measures that we put in place for this application. And did you use lang chain for putting that guards or some like any any technical details you want to mention there?

[00:19:23]>> Yeah, so for PII we used open source models from glider PII. So that is again an open source >> and you can find it on hugging face.

[00:19:31]>> You can find it on hugging face.

[00:19:32]Exactly.

[00:19:32]>> So we're going to provide a link of that. Yes.

[00:19:35]>> And for the content moderation we have uh lama guard.

[00:19:39]>> Okay.

[00:19:39]>> So that model is called llama guard which is by llama. So they have different versions of the models as well. And these are the base models. On top of that, we implemented our ONX format classification as well because >> Gler or Lamar gut will offer reliability to some extent, right? But again, lamut is a LLM which has latency.

[00:20:02]>> So to optimize the architecture, we also have this classification rule based and on top of that we have lamagut to have a better system for guardrails.

[00:20:11]>> I see. Okay. And in terms of deployment to production like you want to mention anything about cloud like how you deployed it, how you are monitoring it etc. >> So we used AWS because of the simplicity and uh so entire stack is on AWS.

[00:20:28]>> Okay. And how about front end?

[00:20:30]>> So front end is on react.

[00:20:31]>> Okay.

[00:20:32]>> And u yeah back end is also okay as folks now with cloud code and AI coding developing front end is pretty straightforward right become easy. Yes.

[00:20:41]>> You can tell cloud code. Okay, create a NexJS front end for my personal uh AI chatboard and it will just do it.

[00:20:51]>> Yes. Yes. It has become much similar.

[00:20:53]>> Okay. Any challenges that you would like to mention uh which you face during this project? So one thing is obviously people talk a lot about rag the good side of rag but in real world we have a lot of challenges because retrieval may not be good the data itself may not be good the chunking may not be good and there will be a lot of hallucination so we we need to work on prompting as well >> so and when we are dealing with these legal issues right that accuracy is of utmost importance we don't need to have hallucinations >> so every answer should have a source cited >> so that we can cross check. So getting the chunking correct, what type of retrieval mechanisms we use and then how did we reduce the hallucinations. These require you know a special architecture and careful considerations.

[00:21:43]>> Okay. And you also mentioned that uh let's say I'm looking at H4 EAD case with AC21 clause and in the past let's say it found one case okay but this case was in 2020 >> and the legal document laws that we are using is from 2026 >> okay so what happened was for the specific clause and specific section let's say section 1 + 5.1 in 2020 might have evolved into section 1 clause 5.3.

[00:22:22]>> Yes. So I think client you mentioned that client gave us this kind of mapping.

[00:22:27]>> Yeah.

[00:22:28]>> And when you get folks this kind of mapping what happens is let's say this case was from 2026 the clause is 5.3. Uh the clause is this this case is sorry from 2020 clause is 5.1. So for 2026 it is 5.3 and let's say uh in 20 I'm just mentioning here 2022 this was section 1 but clause was 5.2 two. So in that case um I think you did some special handling right you want to talk about it. So here you know there is change in clause based on the year we have lot of amendments right but we are looking at the 2026 data so in 2026 it is talking about the 5.3 but when we look at the past there won't be any cases with 5.3 because it was with a different sub clause >> okay >> so we will go to 5.1 >> we'll see what are cases in 2020 with 5.1 >> okay >> and we'll go to 2022 then see what are cases under 5.2 to and then you'll search here. We will search. Okay. So just to kind of revise this actually this is the stepby-step process. Here you find the actual clause. Okay. So let's say in the first step the clause that you found is let's say this is latest right 5.3. So here let's say you found 5.3. But you can't just search for 5.3 because 5.3 is valid only in 2026.

[00:23:48]So you have a mapping. So what you'll do is you go to 2020. What is the mapping 5.1 using 5.1 2022 mapping you search then for 2022 mapping is 5.2 so using 5.2 clause whatever cases are containing clause 5.2 you search for it and that is how you uh retrieve the relevant context. This is called context engineering folks. Okay. So what these guys did is actually context engineering. So from this you can understand that when you're going through any academic rag project shriant what they learn is okay you have some nice clean PDF document you have single vector database single collection you retrieve five chunks and you retrieve your answer and your project is a success.

[00:24:33]>> Yeah but you saw that in real life we have to use this kind of mapping this kind of huristic custom approaches to retrieve the relevant context. If the context is out of context window, you have to do pruning, summarization, all those techniques. And also we literally used a custom I would say custom retrieval technique, right?

[00:24:54]>> Custom retrieval. Yes.

[00:24:55]>> And you mentioned that you are working on some kind of GitHub or some kind of white paper for this. You want to talk about it?

[00:25:00]>> Yes. Because uh it is not the general chunking because in any rag project, chunking is a very first and the most important thing. The way we parse a document and we create chunks has the utmost value. So whatever the chunky mechanisms that we have like overlapping or semantic chunking hierarchical chunking that didn't work. If you look at the data it is very structured >> and if you want to prevent all that you know hierarchy in laws we need we need to follow that structure. So we designed our custom parent child chunking >> and we are also writing white paper on that so that we can you know uh explain how we approach that what how did how >> and other companies can also benefit.

[00:25:40]>> Exactly. So folks that parent child chunking is something that I explained in the initial section. Uh Shikant is going to write white paper. We'll provide whenever that paper is out we are going to provide a link to it so that we are going to open source it basically so anybody can use it. So far I have given you the technical details but the business side of any project is equally important. Karan is a person who was involved in dealing with clients and exploring business requirements and so on. So let's go and talk to him as well.

[00:26:18]Hey Kitty.

[00:26:20]Hey. So Katie, you were involved in the interaction with client directly.

[00:26:26]I just want to know what kind of benefit client is getting or is going to get out of this project.

[00:26:32]>> It is really simple. So each law project that is going on, right? Like every case that is being fought on you have a junior associate you have a senior associate and you have somebody who's on top of them which would be generally a partner a junior or a senior. Now in many cases junior associates are just piling through the information that is there in different documents in different public documents and different literature. What the client was seeking out to achieve from the project was how do we reduce the manual labor of these junior associates so that they could like work on higher level tasks.

[00:27:02]>> Okay good. So that this is going to improve the efficiency correct of the overall process and do you want to mention one challenge that you face with this project and which probably many other projects are also facing similar challenge.

[00:27:17]>> So one of the like very common and very very grounded challenges when somebody nontechnical is coming to us right they won't understand how uploading a document to a chat GPT chat window versus rag is going to be different. So what used to happen was client was comparing the questions that they asked the system that we built with what they asked chopy and the difference was very high and the reason was simple like if you're uploading a document into a chart guty window it's in the context window you're going to get some better results if it is a rag project if it is different what's going to happen is it's going to look through thousands of documents and then get a answer. So this was causing a difference that the client was uh experiencing all the time.

[00:27:59]>> Okay. And if there is one advice that you would give to anybody who is building rag project. Okay. Many of the viewers who are watching this video will be the learners or students who are learning AI and there will be some people who are the business folks who wants to build rag project in their company. So is there any advice that you would give to them?

[00:28:19]>> Uh one thing that I would always tell people is curate your data if you can.

[00:28:23]This is something we have tried with another UK based project as well and we are also doing it for the US based law firm project. So what clients were doing in this case is they were curating certain document and certain legal literature. Why? Because one thing is you grab the data as is and you feed it to the AI to the LLMs to the embedding models and it would work. But let's say if you can like convert it in a way which is more intuitive for the AI, of course that is going to take some effort. But if you can make the data from being just human readable to AI readable, you're going to see some differences. One of the common ways is AI is good at reading mile humans are not. Humans love to read if it is like formatted via markdown reader but AI by default can read via markdown variable.

[00:29:04]So smaller changes like these are also like helpful in these cases. Okay, thank you. All right, that's it folks. I hope you learned how to build production grade rag project from this video. We are going to provide resources and some more information in the video description. So please check it out.

[00:29:23]There's a comment box below. If you have any question, feel free. Thank you very much for watching.

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Artificial Intelligence

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

Paris is in SHAMBLES right now 😭

H1T1

4053K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30