This framework masterfully bridges the gap between transient context and long-term cognitive persistence through its sophisticated hierarchical memory and triple-stream retrieval. It offers a definitive, privacy-first blueprint for building truly evolving local AI agents.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AgentMemory + Hermes Agent + Ollama = AI Agent That Never Forgets | Fully Local SetupAdded:
Most AI agents are amnesiac. Every time you start a new session, they have forgotten everything. The project you explained yesterday, the bug you fixed last week, the architectural decision you made last month, all gone. You end up spending the first few minutes of every session re-explaining context that the agent should already know. Today, we are going to check out this new tool called as agent memory, which promises to fix this permanently. We are going to set up this tool on our local system. We are going to integrate it with Hermes agent plus a local Ollama based model.
I'm going to go with Gwen 3.6. By the end of this video, your Hermes agent should remember everything across sessions with a real-time memory viewer, you can watch live as it builds up knowledge about your work. This is Fahad Mirza, and I welcome you to the channel.
Let's get right into this. I'm going to use this Ubuntu system. I have one GPU card Nvidia RTX 5600 with 48 GB of VRAM.
I have this Ollama model locally running. I'm going to use this Hermes one. And if you're looking to rent a GPU on very good price, you can find the link to Mass Compute in video's description with a discount coupon code of 50% for range of GPUs. Let me start by installing a Hermes agent. We already have Ollama. Hermes agent is getting started.
While it installs, let's talk bit more around this tool. So, agent memory is built on something called as triple I engine, as you can see here, and it gives your coding agent a four-tier memory system inspired by how the human brain actually consolidates knowledge.
Raw observations from every tool call get compressed into episodic summaries, then into semantic facts, and then into procedural patterns. It uses a triple stream retrieval system combining BM25 keyword search, vector similarity, and a knowledge graph fused together to get 95.2% retrieval accuracy on the long mem eval benchmark. It works with almost every agent out there, Claude, Code, Cursor, Open Claw, Open Code. And I'm going to use it with Hermes agent to check out how exactly the agent share the same memory server. So, if you have been building context in one agent, you should be able to build build it anywhere else, too.
Let's go back to my terminal, and if you're looking for AI updates, please follow me on X.
Okay, so Hermes agent is almost there.
And if you don't know what Hermes agent is and how to get it installed, what are the features, just go to my channel and search with Hermes agent, and you should be able to find heaps of videos around Hermes agent. As you can see, we have really really pounded it.
And I'm just going to do a very quick setup. It will just go with all the defaults.
And this is where you can set your model provider. One thing which is quite disappointing here is that they have Sorry, I'll just go back. They have simply removed that um model provider, which is Ollama, as you can see here. If you scroll down, they have just given the link to Ollama Cloud. There is Of course, I'll tell you how to configure the local Ollama. We already have done in few videos.
But maybe there's some sort of uh other arrangement, which is quite disappointing. There should be uh you know properly named Ollama local endpoint here.
So, I will just select this custom direct API because Ollama exposes um this custom endpoint open AI compatible, and this is the API base URL where my Ollama base models are running. I don't want to give it any key here.
And you can select number two here for chat completions.
And from here, I'm just going to go my first model, which is Hermes one.
And context length, you know, I already have set it.
And display name, you can just set anything you like.
And everything is now installed. I'm just going to source my bash profile.
And now we can install this agent memory globally.
And agent memory is installed. These warnings are harmless, really. I mean, you can just ignore them. They go from system to system. Now, you can simply start this agent memory by running this command.
And as soon as you run this, it asks you which harness or which agent you want to integrate. Of course, you want to go with Hermes. So, I'm using my arrow keys, and I am pressing enter here.
Which LLM provider?
And as we are using everything local, so I'm just going to select this skip BM25 only mode. BM25 is a keyword based search algorithm. In BM25 only mode, agent memory still captures and stores all memories from every session, but it skips the LLM compression and consolidation step, which means it won't be using any LLM to summarize and compress raw observations into structured facts. Retrieval still works.
Just keyword keyword based search rather than semantic vector search. Of course, for the The of this demo, it is fine, but if you are using it in production, maybe I would suggest go with some other model like OpenAI or Anthropic or even if you want to go with any open router based open, you know, other lab model, you can go with that.
And everything is ready. Now you can wire up your Hermes agent with agent memory by pressing enter.
And now you need to install this triple I agent which I was mentioning earlier.
And it is ready, which is good.
And I'm just going to skip this one because we don't really need any telemetry from triple I agent. We might do another one.
So you can see that server is running.
Everything looks good. You can let this one run and let me open another terminal.
And now let's do a very quick health check that agent memory is all good.
There you go.
So it has passed the health check. I have just done the curl command on to its endpoint, which means that it is fully running on port 3111. The double I engine is connected and 257 function as you can see all these functions are registered and ready.
Okay, so which is quite good. It has also given us a lot of information. Now we need to add this agent memory as an MCP or model context protocol server in Hermes config. This is just a single document. Let me show you.
So first up you would need to open this config.yml file in the Hermes home directory as you can see.
And then go to this line number 330 where memory is being configured.
So we have done two things here. First we have set agent memory as a memory provider as you can see line number 335.
And second we added it as an MCP server at the bottom which gives Hermes agent access to 43 memory tools like memory search, memory save, and memory recall all talking to our local agent memory server running on port 311. That is all I have done here.
Okay, so another thing you can do Now, let's open the agent memory viewer in our browser at port 3113 at localhost. There you go. So, this is a dashboard. You can see everything is at zero right now. No sessions, no memories, no lessons. That is about to change the moment we start chatting with Hermes. Now, let's go back to my terminal and fire up the Hermes agent.
I'm just going to say Hermes here and it is going to launch it first time.
And Hermes agent has started. This is, by the way, very impressive logo and stuff.
And I'll scroll down and our local model powered by Ollama is now detected.
Not only that, if you just scroll up a little bit, you will see that there should be some MCP server.
There you go. So, this is our MCP server with eight tools already present. So, it has detected our MCP server. Now, let me try to give it one prompt, just a coding task to generate some memory. I'm asking it to create me a simple Python Flask KPI to muse something to do something.
Let's wait for it.
And Hermes agent is writing some code, whereas um our agentic memory should refresh soon.
I'll just go back and see.
So, if you just go down here, it just takes 30 seconds to get refreshed.
And now one memory is saved, as you can see here. If you go to the memories tab here you can see that all the details are available. It has been saved with the title which gave it um, in coding prompt, tagged with Flask KPI, and also all the tags are there if you want to correlate.
And it is marked as a fact type with 70% strength at version one.
This memory will now persist across sessions. Next time you start a fresh Army session and ask about this project, it will already know what was built, where the file is, and how the endpoint works. No re-explaining, and that is exactly what agent memory is for.
So, for instance, you can simply just, you know, do new here and start a new session. And it has started a new session. You can just ask it like, uh, sorry, not this one. Maybe I can just ask it something around what Flask work we have done recently, and it should go to the memory and obtain that data.
And as soon as I have given it, you see it has done the recall, and it is telling me that this is what we have built at this path, and we have also verified and done all the stuff.
And, uh, it says that you want me to save the memory. I think the tool got triggered, and it has saved. And how good is that, really?
And even in this new session, you can just go and ask it to do a follow-up.
Maybe, you know, add something here, like a put update endpoint, which was not already there, and it should just create a new memory after doing all this work, and then you can access it from any other session.
And it has updated that put endpoint, and also saved it to our memory. If I just go back and go here, now we have two memories. Plus, if you go to this memories uh, tab and refresh it, it should also show you the other memory. I will just quickly go here.
Okay, so it is not updating here yet.
Maybe later.
Or maybe if I could expand that.
Or it might just have updated this one, but it's not showing it here anyway.
Maybe some bug in this UI cuz the memory is shown here. Anyway, there is also session tab or actions tab. There are a lot of things which you can do here, which I might make another video to go through these advanced topics.
Um because the one thing I really loved about it is that you know, it gives some of the memories at the timeline, too.
But when you have, you know, multiple data, multiple issues. One thing is that uh I'm we might be um we might need to use some OpenAI model or some Anthropic model for embedding and stuff because with the Llama model, there are some restrictions as we just saw. And only then most of these tabs will be fulfilled when we are actually doing any semantic search because that is where it really shines. But anyway, I think already it has done pretty well in terms of, you know, maintaining these memories across the board.
Let me know what do you think about this agent memory. I think it is quite promising.
Please follow me on X for any AI updates. And please become a member if you want to support the channel. Thank you for all the support.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











