Redis is cleverly pivoting from a simple cache to the essential backbone of AI memory to remain relevant in the agentic era. This shift highlights how modern infrastructure is consolidating, turning a specialized tool into a jack-of-all-trades for LLM workflows.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Redis Isn't Just for Caching AnymoreAdded:
Langraph, OpenAI's Agent SDK, Google's Agent Development Kit, Microsoft Agent Framework, the A2A SDK. Five of the most important agent frameworks in the world right now. Built by five different companies, solving the agent problem five different ways. But look at their repos and the same name keeps showing up in places you would not expect. radius not as a cache, as memory, as state, as the retrieval layer, as the queue that ties everything together. Every one of these frameworks ships an official Radius adapter. In this video, we'll walk through the four hard problems every production agent has to solve, what the industry patterns look like for each one, and where Radius fits into the picture. By the end, you'll have a clear mental model for how modern agents are actually built under the hood. Let us get started.
This video is sponsored by Radius specifically their AI incubator which is where they are publishing everything they are building for agents. Agents need four pieces of infrastructure to work in production. Not four nice to haves, four hard requirements. Take any production agent you have seen. Every single time you'll find an LLM in the middle calling tools, reasoning what comes back and deciding what to do next.
But that simple loop hides four problems. First, the agent is having a conversation. Messages come in, tools go out, intermediate results have to be remembered between turns. That is working memory, the scratch pad. Second, the agent needs to remember things across sessions. the user's name, their preferences, what happened last week and that is long-term memory and it is much harder problem than people expect.
Third, the agent does not know your data, your product catalog, your documentation, your customer history.
So, it has to find the right context and inject it into the prompt. That is retrieval. That is rag. Fourth, the agent does not take one step. It takes many. Sometimes it talks to other agents and any of that can fail halfway through. So you need durable state, task cues and coordination underneath the whole thing. Four problems, every production agent has them and most of the industry is still figuring out how to solve them cleanly. So let us take them one at a time. First one, working memory. Where does the agent short-term state live while a conversation is happening? If you have only built toy agents, you might think this is trivial.
Keep the state in a Python dictionary, works fine on your laptop, but falls apart the moment your server restarts or a tool called times out mid conversation. In production, working memory has to live outside your process.
It has to survive restarts. It has to be shared across instances and it has to be fast because every single turn of the conversation hits it. And this is why modern agent frameworks have the idea of a checkp pointer. A pluggable back end that saves the agent state every single step. Langraph has this built-in. You can plug in postgra SQL light anything you want. But for a store that gets hit on every tool call, you want something that responds in microsconds, not milliseconds. And here is what it looks like with radius. Two lines of real code plus the import. Bring in the radius checkpointer. point it at your radius instance, pass it into the graph. Every step of your agent is now saved and recoverable. If your pot dies mid conversation, the next part picks up exactly where the last one left off. And because it is radius underneath, the reads and writes happen in under a millisecond. Your agent does not slow down because the state store became the bottleneck. That is working memory handle. Long-term memory is where things get interesting because most developers underestimate how hard this actually is.
The naive approach is take every user message, embed it, dump it into a vector database, and that is the tutorial version. For a weekend project, it is fine. In production, it falls apart very fast because memory is not just one thing. Users have facts, preferences, past events, and the Reddus agent memory server handles all of that in the background. When your agent has a conversation, it goes into working memory first. In the background, the server extracts structured facts from that conversation. It pulls out topics, recognizes entities, summarizes long stretches, and checks for duplicates.
Then it writes those facts into long-term memory with vector embeddings so you can semantically search them later. From your code, the API is about as simple as it gets. You create a long-term memory with three fields. the text itself, a user ID to scope it, and a memory type that tells the server how to treat it. Could be a preference, a fact, an event, and so on. Notice the awaits because these operations go over the network to memory server. So they are async by default. When you want to retrieve, you search by natural language. The server embeds your query, runs the vector search, filters by user ID, so you only get this user's memories and returns what is relevant. You did not write any embedding code. You did not manage any vector index. All of that is taken care for you here. The other nice thing is that the server speaks MCP alongside its REST interface which means you can plug it into cloud desktop or cursor and the agent itself manages its own memory without you writing any glue code that handles long-term memory. Now let us talk about the third problem retrieval or as most people call it rag which stands for retrieval augmented generation. The idea is simple. Your agent does not know your data. Your product catalog, your docs, your internal knowledge base. Rag is how you give it that context at runtime. Here is the basic recipe. You take your documents, break them into chunks, and turn each chunk into an embedding.
Basically, a list of numbers that captures the meaning of that text. You store those embeddings in a vector database. So when a question comes in, you embed the question the same way.
Find the chunks that are closest in meaning and inject them onto the prompt so the LLM can answer using your data.
And if you're new to Rag and want a deeper foundation, I have a few videos on my channel that walk through it from scratch. Links in the description. But for now, that is the tutorial version and it works. Everyone shifts that. But here is what I want you to notice. The quality gap between a mediocre rack system and a great once does not come from that basic pipeline anymore.
Everyone has that basic pipeline. The gap comes from four things most tutorials skip. First, hybrid search.
Pure vector search is great at understanding meaning, but it struggles with exact matches. Think about it this way. If someone types in a specific product code or a person's name or an order number, vector search has no idea what to do with that. It is looking for meaning, not an exact string. So you need to combine it with regular text search. The kind databases have done for decades. That combination is called hybrid search. And in production, it makes a big difference. Second, metadata filtering. When you are searching in production, you are almost never searching across everything. You are searching within a specific slice of your data. Maybe it is documents that belong to this particular user or products within a certain price range or support tickets from the last 30 days.
If you do not add those filters, your search results will be all over the place because you are pulling in results that are not even relevant to the person asking. Third, re-ranking. Your initial vector search is fast, but it is not very precise. Think of it like a first round of short listing. You get 50 decent results, but not necessarily the best ones. A re-ranker is a second pass that looks at those 50 results more carefully and picks the actual best 10.
This one step improves quality more than switching to a bigger or more expensive embedding model and most teams skip it entirely. Fourth is semantic caching.
Imagine two users asking what is your refund policy and how do I get my money back within a minute of each other.
Those are different words but they mean the same thing. Without caching, you're calling the LLM twice and paying for it twice. Semantic caching solves this by remembering not just the exact question, but the meaning behind it. So when a similar question comes in, it returns the saved answer instantly without touching the LLM at all. And we are actually building this out as a hands-on lab in this video where you'll see exactly how Radius handles this using Radius VL. Now the vector database market is crowded. Pine cone, Chroma, VV8, Q brand, Postgrace with PG Vector, Elastic. They all do vectors and the choice between them is rarely about raw speed anymore. It's about operational simplicity. How much infrastructure are you willing to run just for retrieval?
And this is where Radius starts to look different because if you're already running Radius and most of us are, you probably do not need a separate vector database at all. Radius has a library called Radius V. It stands for radius vector library. It's a Python and a Java client that turns your existing radius into first class AI data layer. And it is built around three core primitives.
The first primitive is the schema. This is where you declare what your index looks like. Which fields are text? Which fields are tags? Which are numbers?
Which are vectors? And how the vector index should be configured. You can then write it in Python or in a YAML file.
And here is what a product index looks like.
Read that top to bottom and the whole shape of the index is right there in front of you. You have a category tag for filtering, a price as a number, so you can do range queries on it, a description, a searchable text, and an embedding field that stores the vector.
Pay attention to the choices in the emitting block because they matter. HNSW is the algorithm, which is the graph-based approximate nearest neighboring index. This is what you want in production for anything over a few thousand documents.
Cosine is the distance metric which is the standard for text embeddings and the dimensions are set to 1536 which is the size of an open AI embedding model. You write the schema once pass it to radiusv and the library creates the index in radius for you. You never write raw f.create commands and you never think about radius syntax. You describe what you want and radius wheel handles the translation.
The second primitive is the vectorzer.
This is the adapter layer that sits in front of all the embedding providers. So you have one consistent API no matter which provider you are using. Three lines, you create a vectorzer, pick a model and call embed with some text.
What you get back is a Python list of floats 1 536 numbers long which is the embedding. That is the vector you store in radius and that is the vector you search against at query time. The third primitive is the query builder. Look at what this code is actually doing. You're building a query by composing Python objects. A tag filter that says category has to equal footwware. A numeric filter that says price has to be under 200. You combine them with a Python amperand operator which is a logical and and then you pass all that into a vector query along with the embedded search text.
Remember the metadata filtering best practice we talked about a minute ago?
This is exactly how you do it in Reddisville. You never send a query without filters in production. And the query builder makes that easy rather than painful. You call index.query at the end and what you get back is a list of matching documents already ranked by vector similarity already filtered down to the slice you asked for. And here's the part that most vector libraries do not ship at all. Semantic caching built right in. Look at the flow here. Before you make an LLM call, you ask the cache, "Have I seen a question semantically close to this one before?" The distance threshold controls how close is close enough. 0.1 is tight. It only matches near identical questions. A higher value is loser and it will treat more questions as the same. You tune it for your traffic. If the cash hits, you return the stored answer instantly without calling the LLM at all. If it misses, you call the LLM like normal and store the result for the next time. In a real application, this cuts LLM cost by somewhere between 20 and 60%. Depending on how repetitive your traffic is and it cost you nothing extra to run because you already have radius. Let me show you with a quick demo. So in this knife setup, every user query goes straight to the LLM. Every single one you ask something, it calls the model, you pay for it. Simple but expensive. The first improvement is a basic cache. Query comes in, check if you have seen it before, return the stored answer if yes, call the LLM. If not, that works, but only for exact matches. Someone asks, "How do I get a refund?" And the next person asks, "What is your refund policy?" It misses. Two LLM calls when it should have been one. That is where Radius wheel comes in. Now we convert the query into a vector embedding first.
store that in radius and do a hybrid search on the way in. Keyword plus vector similarity together. So if a semantically similar query already exist, we return the cache response instantly. No llm call, no cost. Let me show you this running on a real application. This is a bitemunk store chatbot example P. Reddius is running in a docker in the background. I ask who is the founder of bitemark? Answer comes back sourced from the LLM. It's the first time Reddius has seen this. So it hits the model and stores the result.
Now I ask who is the creator of bitemong? Different words but same meaning. Source says radius. It did not touch the LLM at all. Found the semantically similar question in the cache and returned the answer instantly.
And that is semantic caching in real application. faster responses, lower cost, no extra infrastructure because you already have Radius. And all the code for this is published in our GitHub with full setup instructions.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 viewsβ’2026-05-28
How agent o11y differs from traditional o11y β Phil Hetzel, Braintrust
aiDotEngineer
450 viewsβ’2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanationπ―β
LearnwithSahera
1K viewsβ’2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 viewsβ’2026-05-29
Search Algorithms Explained in 60 Seconds! π€π¨
samarthtuliofficial
218 viewsβ’2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 viewsβ’2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 viewsβ’2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 viewsβ’2026-06-01











