A RAG (Retrieval-Augmented Generation) application can be built using a single database (Oracle AI Database) that handles both embedding generation and vector storage, eliminating the need for separate embedding APIs, vector databases, and metadata stores. This unified approach simplifies the architecture by keeping all data operations within one database, reducing operational complexity, eliminating multiple API keys and billing systems, and enabling offline operation since no external services are required. The embedding model (all-MiniLM-L12-v2) is loaded directly into the database via SQL, and vector similarity search is performed using SQL queries with HNSW indexes for fast retrieval.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Build a RAG App on Just One Database (Oracle AI Database + LangChain Demo)Added:
Here is something I see all the time when developers try to build a rag application for the first time. They follow a tutorial that tells them to sign up for a model API to handle the embeddings, then to sign up for a specialized vector database to store the vectors, set up a relational database to hold the original text and metadata, and wire all three services together using LangChain. By the time they actually have a working AI application, they're juggling three API keys, three different services that each send them a bill, and a sync process that keeps trying to keep all three of them in line. I think most of us have gone through some version of that experience at least once. Today, I want to show you a different way to build this.
We're going to put together a complete rag application that includes embeddings, vector search, retrieval, and citations, all running on a single database.
The whole thing will run on one Oracle AI database with around 100 lines of Python and a local language model running on my laptop.
End-to-end, this entire stack costs $0 to run, and by the end of this video, you will have a repo you can clone and adapt to your own data. Let's get into it. Before we get to the demo, let me walk you through why most rag architectures end up more complicated than they need to be. Here is the architecture you will find in most production rag systems today. You start with your documents and turn them into vectors by making an API call to an embedding service, like OpenAI or Cohere, which charges you by token and adds network latency to every request.
Once you have those vectors back, you store them in a dedicated vector database, which is likely another service with its own credentials, pricing, and network hops.
Meanwhile, the original text and the metadata about each chunk have to live in a database, too. So, you actually end up running Postgres or some other relational database alongside everything else. And because your vectors and your text now live in two different systems, you need a sync process to keep them lined up whenever a document gets added or updated. So, you end up with three services, three sets of credentials, and three things that can break independently of each other.
If you actually look at what's happening, it's all just data operations. We're storing some text, computing some math on it, and finding the closest matches.
None of that actually needs three separate services in three different places.
We just got used to the pattern because that's how early tools shipped, and most tutorials still teach it that way. The version we're building today collapses all of that into one place.
The embedding model, the vectors, the original text, and the metadata all live inside Oracle AI database 26 AI. When a question comes in, the database embeds it, finds the matching chunks, and returns them in a single query. The only piece of the system that lives outside the database is the language model that generates the final answer. And even that runs locally on my machine through Ollama.
So, end-to-end, the entire application makes zero external API calls. Let me show you what that looks like in practice. The whole project lives in a single GitHub repo called RAG on Oracle 26 AI. The link is in the description. I will clone it now and walk through the setup as I go, so you can follow along on your machine. Here's what's in it.
Five files do most of the work.
The Docker Compose YAML file runs Oracle locally.
The seed.py script loads the documents and embeds them. The app.py is the chat UI. The retrieval.sql file shows the actual SQL query that powers retrieval in isolation.
And the make file wraps everything in one word commands. So, you don't have to remember the individual steps. The whole project is small enough to read end-to-end in about 10 minutes. So, let's run the setup.
This creates a Python virtual environment, installs all requirements from requirements.txt, and copies the example environment file so the project has its credentials.
Takes about 30 seconds. Now, I will start the database. The first time you run this, Docker pulls the Oracle database free image, which is about 10 GB. So, plan for that download on your first try. After the first time, it starts in just seconds. The database needs about a minute to come up cold.
So, while it's loading, I want to pull the local language model.
We're using Llama 3.2 3B through Ollama, which is about 2 GB. You only need to do this once. Let me check on the database.
It's healthy. There is one more setup step before we can load any data.
This connects to the database and configures the vector memory size parameter, which is the memory pool Oracle uses for HNSW vector indexes.
Without this step, the database falls back to a slower index type. The script also restarts the database, so we wait another minute or so for it to come back. Now, we get to the part that's actually different from a normal RAG setup. Let me open up seed.py.
This is the line that tells the whole story. When we insert a chunk into the database, the embedding column is populated by a SQL function called vector embedding, which in turn creates a model loaded directly into the database.
We're not making an HTTP request to OpenAI. We're not running a Python embedding library. We're not doing any math outside the database. The database handles it all in the same transaction as the insert.
The model we're using is all-MiniLM-L12-v2, which is the same model many of you probably have used with sentence transformers or Hugging Face. Oracle distributes a pre-built ONNX Open Neural Network Exchange version of it that you can load with one SQL call. And after that, generating an embedding becomes a regular function call in SQL. Once you see that, the whole multi-service architecture starts to feel a little unnecessary. Okay, let me check on the database. It's healthy and ready. So, now let's run the seed.
Right now, it's downloading the embedding model and three of Oracle's official documentation PDFs. The AI Vector Search Guide, the JSON Developers Guide, and the Database Concepts Guide.
After that, it creates a dedicated demo user, loads the model into the database, and chunks all three documents, inserting them in batches. The whole process takes about a minute and a half and produces about 3,000 chunks across the three documents. At the very end, you will see it builds an HNSW index, which is Hierarchical Navigable Small World index for fast similarity search. And there it is, it's all done.
Now, let's start the chat. Here is the UI. It's a simple chat interface with search box at the top, four suggested questions tagged by category, and a status note at the bottom reminding us that nothing is leaving this machine.
Let me click the first suggestion to see how it handles a straightforward question. The answer streams in word by word, which is the LangChain stream API working with Ollama.
As the answers fill in on the left, you can see the citation panel on the right populating with five chunks from the documentation.
Each citation shows which document it came from and what page it's on. In this case, all five citations come from the AI Vector Search Guide, which makes sense because that's where the answer actually lives. Now, let's try something a bit more interesting. A cross-document question. Take a look at the citations panel this time. We're now pulling chunks from two different documents, the Vector Search Guide and the JSON Developer Guide, and the answers combine what each of them says into a single grounded response. The retriever didn't filter by document or do anything special. It just looked across the entire corpus for the chunks that were most similar to the question. And the answers happen to involve concepts from both documents. Cross-document retrieval like that is something you practically only get for free when all your data lives in the same place. I want to show you what's actually happening in the database when one of these questions comes in because I think it makes the architecture really concrete.
This file is called retrieval SQL and it's the entire retrieval layer for the application. The whole query is less than 10 lines of SQL. Walk you through it. We take the user's question, pass it through vector embedding, which uses the model that's loaded inside the database to turn the question into a vector.
Then, we use vector distance to compare that question vector against every chunk vector in the table using cosine similarity.
The HNSW index makes this fast even as the corpus grows.
Finally, we order the results by distance and ask for the top five.
That's the whole retrieval layer in a single SQL statement with no external services and no application side merging. And here is the LangChain code that wires the database into the chat application.
The Oracle embedding class with the provider set to database tells LangChain to ask the database to handle embeddings instead of doing it in Python.
The Oracle VS class is the vector store wrapper around the table already created.
From those two lines onwards, it's a standard LangChain pipeline. If you've used LangChain before with any other vector store, this is going to look completely familiar. The only thing that's different is where the data lives.
Let me quickly prove the offline claim because I think it's worth seeing it for yourself. The Wi-Fi is off, the machine is fully disconnected from the internet.
Let me ask another question.
The answer comes back exactly the same way because nothing about the square needs to leave the box.
The model is in the database, the vectors are in the database, the chunks are in the database, and the language model is on the machine.
There's nothing for the network to do.
Stepping back for a second, I want to tell you why this pattern matters beyond just being a fun demo.
The first reason is operational and it's the one that hits hardest if you've been on call for a multi-service rag system.
When all of your retrieval state lives in one database, you have one thing to back up, one thing to monitor, and one thing to upgrade, and one set of credentials to rotate.
When something goes wrong at 3:00 in the morning, you don't have to figure out which of these three services is misbehaving and how their failure modes are interacting with each other.
The simplification adds up in ways that are hard to see until you've had to deal with the messier version yourself. The second reason is cost at scale.
Paper token pricing on embedding APIs is cheap when you're prototyping with a few thousand documents, but it gets expensive surprisingly fast when you're embedding millions of chunks or when you have to re-embed everything because you changed your chunking strategy.
With the model running inside the database, your embedding cost is just whatever your database compute already costs. And that's a number you're paying anyway. And honestly, beyond the operational and the cost story, there is just something nice about is stack. You can fully understand by reading the code. Everything that happens in this application is visible to you. You can read the SQL, see exactly which rows are being compared.
You can read the Python code and see exactly how the chain is structured.
There's no black box somewhere out doing the magic that you have to take on faith.
For a lot of teams, that kind of transparency is worth more than any kind of single feature.
If you want to try this on your own machine, the repo is in the description.
You can clone it, run make setup, then make seed, then make run. You'll have a working rag application running locally in about 10 minutes, most of which is just waiting for Docker and Ollama to download things for the first time. To swap in your own data, you only need to touch seed.py. There's a list called documents at the top with three URLs and source names. Replace those with your own files, run make seed again, and you have a rag application over your own corpus without changing anything else.
If you have questions about rag, vector search, or Oracle integration specifically, drop them in the comments and I will do my best to answer them.
Make sure to like this video and subscribe to the channel if you found this useful, and I will see you in the next one.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsβ’2026-05-29
Long-Running Agents β Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsβ’2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K viewsβ’2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsβ’2026-05-28
BREAKING: Microsoftβs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsβ’2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsβ’2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsβ’2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsβ’2026-05-29











