The discussion offers a necessary pivot from brute-force context windows to the nuanced engineering of "forgetting" as a core feature of intelligence. It provides a grounded exploration of how to balance persistent state with operational efficiency in modern agentic systems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
GPT-Realtime-2, Directionally Bad and Agent MemoryAdded:
All right. Hello.
Welcome to the stream.
Can everybody hear me?
Let me know. Let me know.
Hello.
Yeah, we're starting with this image by people.
Just absolutely terrifying. Can you hear me everybody?
I hear you. Sweet. All right. Welcome.
Welcome.
Let me know where you're watching from while we just enjoy take in what is the majesty of this image.
All right. What's up from Brazil, Thailand, Cairo, South Africa, Norway.
Awesome.
Welcome. We got a lot of people from all over the world. I love that. That's so cool. Romania, awesome. Chrome, Yousef from Belgium.
Luca from New York City, George from Philly, Cornell from Hungary. Awesome. Fox on the run, Germany. Dang, we got everybody from all over. Yeah, this uh this image is everything.
All right, Portland. Gareth, what's up?
Gareth, India, AI, Idorun.
Hopefully I'm pronouncing that right.
Yeah, so we had some good memes coming out of yesterday.
Uh, I made I did a stream yesterday about all the news with Daario and Elon working together.
That was a lot of fun. Uh, we're publishing a video about that right after the live stream is done. So, be sure to check that out. If you're here, like the stream, share the stream.
Really appreciate it.
from PA. Chris, hello.
Yeah, man. Uh, great memes from yesterday. Not only do we have that fantastic image, but we have potentially some of the coolest text messages I've ever seen. Directionally very bad.
Mirror Morati. Did everybody see this already?
This is so for those of you for for context, this is uh the text messages going back and forth between Sam Alman and Mir Maratti during that very brief couple day period in which Sam was uh getting fired from OpenAI. Actually, this was really like the hour or two leading up to it. Um Mr. T. Morton will need some thoughts on SubQ also. There's just not much there.
Uh, bombastic business. First time catching your live stream. Welcome.
Yeah. So, look, imagine like try to try to put yourself in in Sam's shoes right now. You're sitting outside the boardroom.
You have Mirror Morati and the rest of the board who just fired you. You're texting. You're saying, "What's going on? What's going on?" You have Satia Nadella texting you saying, "What's going on?" He's also texting the board, "What's going on?" And you're sitting out there furiously texting with Mir Maratti. Can you indicate directionally good or bad? Satia and others anxious.
Directionally very bad. Okay. Like I've been there. It is the most relatable text message I've ever seen. Um you know, you're like texting somebody, you know, maybe there's some bad news coming and then you get it directionally very bad. Okay. What are you supposed to do other than say, "Okay."
Um, all right. I'm gonna We got a couple other things to cover and then I I'm going to bring on a a very special guest. We have Richmond De uh from Oracle, just an absolute master of agents and memory. And we're going to be talking about agent memory. There's a lot to talk about. uh we're going to get his thoughts on Anthropic's dream feature and then he's going to tell us how he thinks about agent memory, how he gets the most out of it, how you can get the most out of it. It's going to get a little technical. Uh ask questions in chat. That's what we're here for. That's what Richmond's here for. So, be sure to do that. Um okay. What happened? Uh Mr. BCA.
Yeah. Okay. So, just again, imagine, can you wrap up soon? Lots of pressure from Microsoft for an update. Sam, this is very bad. I I swear I've had these same text messages with people in the past.
Errol from London, what's up?
Uh he's like, can I come in? Just imagine imagine he's sitting out there alone like hunched over a bench. Uh it's just wild. It's just wild to picture.
You can really like put yourself in his mindset. They don't want you to. H. And you know, I think Sam actually like handled it pretty well.
I don't know if he knew that these text messages would become public one day, but he sure like he handled it with with grace. What do you want? What do you want to make it better? I'm still willing to just walk away if that helps.
If they are ramped up for crazy lawsuits against me, then I'm not sure what.
And yeah, so he's just going back and forth. Okay. So, you you might have seen this right now. Uh and then we're going to cover this after Richmond comes on, but uh OpenAI just released GPT Realtime 2. So, really voice is getting a big upgrade. We'll cover this right after. This was just released.
Um I think it's time. Let's bring in our guest. Richmond, how you feeling? You ready to come in? Give me a thumbs up.
Yeah, buddy. All right, so let me bring Richmond on.
Oh, that didn't work.
Let's try that again. All right, Richmond, welcome. I just have to fix my camera.
>> Nice. Thanks for having me, Matt. And >> hey, Richmond.
You can hear me. All right. All right.
Right.
>> I can hear you. Yeah. I just can't see myself. Let's see.
Nope.
Uh, sorry. Give me one moment. This is all because I switched video capture cards.
There we go. All right. So, we're gonna have to do it this way for now. Welcome, Richmond. Welcome. Uh, so Richmond is uh developer experience at Oracle. Is that right?
>> Yeah. Yeah. Um, helping out with the developer experience at Oracle. The mission is to re reach um AI developers.
>> Yeah. Well, very cool. I I first saw you um a I think it was a while ago, maybe six nine months ago and I was watching a video that you had made about agent memory and I was just blown away by your depth of knowledge and obviously like technical prowess as well, but um just how well you explained and simply you explained these like very technical topics. So I we've been trying to get together and and I hit you up earlier in the week and I was like, "Hey, you want to come on stream?" And you were just like, "Yeah, man. Let's do it. So, really appreciate it.
Strong echo. Got it. Okay. Working on it. Uh echo. Yep. I think Are you wearing headphones, Richmond?
No, you're not. Uh that might help. I see the echo might be uh on my end as well.
>> I'll throw my headphones on. Look, we're we're figuring it out live. That's what we do.
Exactly.
>> All right.
Give us a moment. We're figuring this out.
>> All right. So, is there still an echo?
Yeah, I don't hear any. Uh, chat, do you hear echo now? I hear you in my earphones.
>> I hear you. My >> hardware is hard, Mr. T. Morton.
Hardware is hard. AGI will not save us.
It's good. There we go. Sweet.
>> That's good.
>> All right, we're in. So, yeah. Thanks, man. I appreciate you joining the stream.
>> For sure. For sure. Like you said, we were meant to do this a while ago, but I'm just I'm glad we're here now.
>> Yeah. And I realize uh you want to share your screen today and I need to figure out how to let you do that. That's a whole new thing.
>> Well, I see a share screen button on my side.
>> Yeah.
>> So, >> uh All right. Well, before before we do that, since I already have mine sharing, um I want to talk about the dream feature because you are the right person to give your thoughts on it. Um >> Yeah.
>> Right. Is it not here? Claude, come on.
There we go. Dreaming.
Um, okay. Let me let me pull up the blog post and I'll just read a little bit about the dream feature. Marmar Labs, hey, what's going on? Uh, that's manage cloud agents.
All right. So, today we're launching Dreaming in Claude managed agents as a research preview. Dreaming extends memory by reviewing past sessions to find patterns and help agents self-improve it. I think just the name dreaming is it it just makes it sound really really cool. Uh but Richmond, tell me like your initial thoughts. What do you think this feature actually is doing? Um and and why why why would they do this?
>> Yeah, so I haven't had a massive deep dive into this feature. Um but it sounds like the name dreaming is actually good because one thing I find is creating human parallels with in the space of agent memory helps people understand what's going on. So dreaming humans we dream as a way to sort of uh consolidate our memory and I I'm not I'm not a neuroscientist. So if there's any neuroscientists in the chat you can put a a better description of of on why we dream. But the dreaming mechanism in humans is important for us to be able to remember information that we've experienced during the day and helps our actual brain um I guess flush out other sort of uh other sort of waste in a sense but in the sense right in the technical sense it I think what they're doing is just consolidating memory fixing reinforcing some memory signals that they have stored maybe doing a session interaction that you're having with um claude itself or and also forget getting some of the previous information that might not be as useful. And what that will allow is next time you're using Claude, maybe after it's woken up or in a future session, you will start to realize it has better memory or can surface up information much quicker and it starts to learn some of your patterns. In fact, this is not new, right? I think a couple years ago, um, the letter, uh, MEGPT guys, the guys that wrote the MEGPT paper, um, Sarah and Charles, they released a paper called sleep time compute.
>> Yeah.
>> Um, which you got a paper up. Nice. Um, >> which is the similar, it's kind of like a similar concept. And if there's anyone you want to be following in the space of of agent memory, it would be these guys, right? These guys are probably like a solid year ahead of the space. So they released the time compute. What year was it? Does it say the year of the >> Yeah, this was uh just about one year ago actually.
>> Yeah. So exactly. So these guys are literally a year ahead. Um so the they really put out some good stuff and uh yeah. So I think that's what they're doing there. just reconsolidating memories, solving conflicts, surfacing off patterns so that the next time you interact with Claude, it will remember and surface up information much quicker.
>> Yeah, it's super interesting. Um I I don't think I appreciated that they were actually over a year ahead of of what the industry is doing. Yeah. I mean, if you haven't seen Leta, go check them out. Friends of the channel as well.
Charles is awesome. Sarah's awesome. Um it it's like there's a few benefits. Uh let me let me actually switch back to this page. So I I think there's a few benefits here and and Richmond you tell me what you think. One is look until literally yesterday anthropic has been compute constraint. And if there was a way to offer better quality responses with less tokens, they would take it.
And I actually think this is a way to do that because you're consolidating all of these memories. You're cleaning them up.
The agent self-improves in a way. The agent knows what you want more cleanly.
And what does that mean?
>> That means that when you during the day ask it a question, it's going to be able to answer your question more precisely with fewer tokens. And so essentially what they're doing is almost offloading some of the compute uh demand to non- peak hours overnight, right? They call it dreaming. I the the the branding is is as you said like it's it's absolutely brilliant. So they offload some of that compute to to non- peak hours during the middle of the night and then all of a sudden when you go to use it during peak hours it's more effective with less tokens. Is that kind of your read as well?
>> Yeah. So yeah, that is my read and it's all about operational cost right even for for them and also for us as the AI developers uh reduction of operational cost is where the uh where the the field is heading and agent memory solves it and I and I've been in the space for talking about agent memory for three years now over three years and it's just such an obvious problem to solve for every player in the in the AI stack and I see some folks in your comment section mentioning in hindsight. So hindsight hindsight um is another key player in the agent memory space. Um uh it's a it's a paper they have a paper out and it's built by Vectorize and I the CEO actually we jumped on a call a few weeks ago Chris um we I jumped on a call with Chris and uh we're speaking about agent memory and uh we're cooking up some things with hindsight as well over on the Oracle AI database site. So we we can definitely talk about a lot of things. Um but yeah, it's all about saving operational cost um on the inference side on on the actual um model performance side as well. And the human analogy helps people understand what we're actually doing. And it's not just anthropic, right? Open AAI actually did something on memory this week as well, right? So they improved the UI where users can actually um upvote or downvolt particular informations on memory that are used as sources um to answer questions.
So I think open AI is leaning more on the human feedback to actually improve um I call them memory units which is the atomic the smallest atomic units of information than that an agent can actually use to remember. So you have this um open AI using this human feedback of upvoting, downvoting and correcting to improve memory itself for its users.
>> And where where is this again? Where are you seeing this?
>> Yeah, I saw um if you go on the uh OpenAI's um uh tweet to uh X page, let me see if I can get it up. Uh they just put this a few days ago. So let me see.
Catch GBT.
>> Yeah, let's go.
>> Let's see.
>> Let's see.
>> GBG instant >> image 2.0.
>> I think it's on that thread with the chat GPT instant. Is it?
>> Okay, let's take a look.
>> Let's see.
Yeah, here we go. Better memory, more personalized. It's so it's in here somewhere.
>> Memory sources.
>> Yeah, memory sources. That's it.
>> Okay.
>> Um >> talk Oh, I see. Right. Relevant, non-memory, not relevant. Talk talk a little bit about what they're doing here. What we're seeing in the screenshot.
>> Yeah. So, I I I don't know about the mechanisms, right? But what they're doing here is very simple. the information that you want to surface up right right um the information or memories that GPT is probably consolidating over interaction and it's using them to answer questions in a moment but there needs to be a way to sort of say oh this source you're using this memory you're using is bad and don't use it again or it's good use it again it's a reinforcing mechanism of um of of memory for GPT or for the LLM itself which is going to improve the experience the personalization of GPT to the end user. But again, this is not new because this we've seen this mechanism um I think three years back in a Stanford paper called Simuaka generated.
>> Oh yeah. Oh, I made a video about that.
Human Simulator locker. Yeah. Uh simula paper. That was so good.
>> Uh here this one >> Jun Park. Yep.
>> Bam. Th this one this one changed my life. I I read this maybe a dozen times, Richmond. Like I this was one of the most fascinating papers that I ever read. Actually, one of the first papers I ever reviewed on the channel.
>> Oh wow. Wow. Yeah. It's it's such it's such a good paper and um it's one of the paper I when I met um Andrew Inc. for the first time, it's one of the first paper we both spoke about, right? Uh this was over three years ago when I met him. We did a couple courses together and we were just nerding out on this paper and this paper has um I think has all the foundation work that we see in a lot of memory implementation today right including this forgetting mechanism.
Totally >> or this mechan Yeah. So it's a good paper it's still relevant today. If you haven't read it you should go read it.
you're looking at again it's three years ago or maybe four years ago now I don't even know but it's so relevant today.
>> Yeah, absolutely. Um just imagine they dro Okay, so look at the date first of all. August 6, 2023, not too long after Chat GBT really blew up. And so what they did was they took a thousand agents, gave them all personalities, gave them all memory and and dropped them in this little simulated town and allowed them to basically live life and >> there was some really interesting emergent properties that that came from it. like they would form relationships and then uh I think one example I mean maybe it doesn't seem as as mind-blowing today but at the time it was um one agent it was their birthday so they invited their friend to their birthday party their friend invited that another friend and and asked if it was okay if they could bring another friend like this is all like very emergent humanlike behavior which I is so interesting >> Matt and I have to agree with you which is it it it might sound not exciting today but it was so exciting at the time I was like wow >> obviously um memor is where it's at. I would say this right this was and the way I describe it to people today this was mult book before mult book >> yes oh I didn't put that together that's such a good call yes >> totally uh okay okay so what do you I think like we were talking right before we started the stream and you you had some thoughts as to like is human in the loop could a human grading human as a judge going to be persistent. Is it going to stick around? Cuz it just seems like that's that's probably the slow part of this entire process here.
>> Yeah, I I do think when in terms of agent memory, I do think the ideal is not to have this human feedback loop.
You want it to be automatic. Um in the short term, we are going to need this feedback mechanism, especially for the frontier labs. they're probably going to use this feedback mechanism as a way to retrain some of the future model to handle memory better, right? So, in the short term, we're probably going to see it. Um, in the long term, it's going to go away and it's probably going to go away very quickly. As in, I teach a lot of people about agent memory and you can get some good performance without necessarily having human feedback in in in the system. But for a lot of um for open AI where they're serving hundreds of millions of customers in a week, they can't afford to make I guess mistakes as in they've they did one previously uh maybe last year with the sickopans.
It's such a hard word to say, but I don't know if you remember when OpenAI had this minor blunder with their personal their personality implementation with >> they had to roll it back. Yeah, I remember that.
>> Roll it back.
>> That's memory, right? That's a memory problem, >> right? The the person >> Sorry, go on.
>> No, no, no. You please continue. Please continue.
>> Yeah, I was going to say that's a memory problem, right? Because one thing I I teach around agent memory and um I I mentioned I'm saying the word I teach a lot but um one of my latest courses with Andrew Ing uh on deep learning AI and we touch on all of these things you're seeing in the space today. So you can go take the course and you'll be clued us to Is there a place I can share a link or something?
>> Yeah. Yeah. Yeah. Yeah. Um let's see. if you drop it in our chat, I'll I'll have uh Brian drop it in uh the uh the the public chat and then um Richmond, as we're going through this, where can people find you? Where where like if they want to check you out on X, if they want to check you out other places, what's the right place?
>> Yeah. Um you can just Richmond al my name is is my handle um on X and you can find me on the same on on LinkedIn. I'll put the links as well to be shared. So I've just shared the link um with the actual course where we touch on everything memory um and put I've put the links to my social media profile but >> um >> yeah so memory is very important it's easy to understand that's the best thing about agent memory um when I say the word memory uh my mom understands it my grandma understands it right so it's easy to understand it brings everyone together when we talk about what we're trying to do in in in the space of AI today and we're seeing a lot of frontier labs um actually solving this but we're seeing a lot of solo developers solving memory as well. I think there's a level playing field in agent memory where you can have um someone just build a solution and raise several millions and go to market or you have Frontier Labs actually solving this um live and trying to put something out there but it's not solved. Memory is not solved.
>> Yeah. Um, okay. So, I'm I I just dropped your X link. I'm also going to drop the deep learning course. Go check it out. I believe it's free, >> right?
>> Yeah. Yeah, it's free. Deep learning is free.
>> Yep. All right. I I actually Fox on the Run had a good question. I would love you to address this. So, let me let me read it. Uh, honest question. Isn't dreaming just admitting transformers can't hold state? So, we bolt on a Passover logs and call it learning. when does the field move to substrates where memory and compute fuse and I I think this is a common uh critique of current large language model or current AI architecture so want to get your thoughts do you think that's true uh yeah that is definitely true a lot of the solutions we see around um agent memory today are not going touching the actual um model weights and parameters right we're not changing the internal states of this LLM It's more of a it's is it's bolt-on, >> right? Um that's true. And to be honest, and I would argue and and I'm still trying to figure this out. I would argue that that can be enough. I'm still trying to figure this out in in real time. Is it enough? Will it get us to where we want to get to? Um we're yet to see. But one thing I would say is this is where you go into the field of continual learning, which is not a new field in computer science, by the way. Continual learning is a is an age-old field within computer science, but continual learning is where you probably want to be getting to.
We're looking at how can we change the internal state of um a neuronet network to be able to represent maybe new information and do that continuously um in a system that is meant to operate at real time. Right. Yeah. Um, but the the addition to this is yeah, it's it's continuous learning is what we're trying to solve. And Dario did a podcast with Daesh. Um, I don't know how long it was.
I think it was this year.
>> Yeah. Yeah. It was a couple months ago.
Yeah.
>> And he said just with the in context learning capabilities of this LLM, we can get to trillion or multiple trillion um uh value.
in terms of economic value.
>> Yeah.
>> So >> he his opinion was we don't need this continual learning paradigm to be able to realize this gains from this LLM and usefulness. But um again in context learning on or actually changing the states within the the models parameters I'm still trying to figure where it's going to um where things are going to lie and I think I'm in the best place to figure this out. So I'm at Oracle and Oracle where we're in the arena. We're experimenting with agent memory every single day. We talk about it. We put things out there in the frontier. We just released a package um we released a package literally this uh last week that helps implement agent memory in um in most of the agents people are building.
This is a Python package but we're also exploring continual learning with um some of the compute we have and some of the models we have on our on our infrastructure.
>> Yeah. and and Richmond before so we're you're going to share some things and I we'll get to that in a second but I I want to I want to comment on this because this is a really interesting topic is it like to to reach AGI or to have that next like step function improvement in AI is it required to actually evolve the weights meaning in real time actually change the weights continual learning as you said um >> it's it's interesting because like I I I think you're right and obviously Daario is like he made the prediction we can get trillions of dollars in economic value simply by building a harness around it, allowing it to have memory.
Uh that that on without ever changing the weights, static weights. I I think that's probably it. And then if you think like okay um if you just look at the the weights itself maybe that's just like the engine of a car and you still need the car around it to get the horsepower to the road and drive somewhere, right? So like maybe the model weights are never going to be enough and you needed the harness around it. You needed memory. You needed tools.
Uh, I like I I keep flip-flopping too.
I'm not I'm not sure. It just seems like >> with all the progress that we've seen, especially around harness engineering in the last 6 months, um, we're going to be able to get tremendous value out of the current architecture of these models.
>> I I I'm like you. I'm not sure. The only way I can be sure is just by being in the arena, Matt. So, that is the only way I can be sure. So, I'm I'm learning and I'm teaching and I'm experimenting um in real time. So, I'm going to be in in the Bay Area next week and I'm going to be sitting down with a bunch of developers specifically on continue learning, what's working, what's not working, what the pers uh perspective of of developers in in in this in the space of age of memory. So, um I can share a link but it's more of a private event so I don't want it to be oversubscribed.
>> Cool. Um, okay. Uh, one one quick thing.
So, this is actually from Darrett Laura based memory first. This is actually something that I've seen. I read a few research papers where they were doing like real time um uh real time Laura basically kind of modifying a part of the model to essentially update it in real time. Have you thought about this at all?
Yeah, I I I thought about it before this current wave of uh LLMs and so my background is in deep learning, machine learning um um at the academic level. Um we we we used to have this thing called um feature stores, right? which which was this I I don't know if you if you know what I'm talking about but feature stores in the traditional machine learning sense was was this interim data layer where you can have your um you could have this high signal information that you could use to fine-tune machine learning models that were in production right >> and this is we had this paradigm um a few years ago the the reason why we're not seeing that today is LLM's are massive in terms of model waste and parameters and you can't fine-tune them in in um in real time and obviously using Laura can be a technique where you can actually get to that sort of reduction in latency and fine-tuning um uh these LLMs in production. I have seen it and it's one of the directions that we're looking at uh experimenting. Um you if you follow me uh on LinkedIn or on Twitter, you follow Oracle Developer, you're going to see something from us in a few months. Like I said, we are we're definitely cooking.
>> Okay. Uh Richmond, let's I'm going to say sorry in advance. We're trying to I'm going to try to make this screen share work in real time. Let's just see if it works. Let's figure it out together. Go ahead and uh share your screen. Let's see what happens.
>> Excellent. So, let me do this.
>> You You would think I would be more prepared, but um you would be wrong.
>> No, it's all good. Uh yeah, let's see this. So, >> okay, hold on. Oh, okay. I think I got this.
>> Can you see the side?
>> I got that. Let's see.
Solo PFP.
Oh, hold on. I'm working on it.
>> Nice.
>> Yeah. So, I got this. Oh, there we go.
All right, you're on screen. Let's see if I can get me on screen as well. There we are. Look at that.
>> Nice. Nice.
>> All right, walk us through it, Richmond.
>> We don't need AGI. Um, let's go. So, I'm just going to share a bit. Whenever I whenever I talk to folks, I don't I don't like to assume the knowledge they have. So, I'm going to say a few things that might be a bit basic. Um, but that's just building up on our knowledge. Um then we I'll show you a couple of uh links. I'm going to look at I'm going to show you guys this Oracle agent memory that we released and there was a blog post talking about how we actually perform on benchmarks like longme eval.
I have a demo over here um an application. Then I also have um a GitHub repository where the team over here we we literally work on this repository every single day and we have a segment a section dedicated to agent memory. So and we have several notebooks that you can just get a deep dive and understand what this whole space is about. So >> zoom out zoom out for a second. What what is like >> how should be people be viewing this stuff? Is it is it like something they can use daytoday? Is it just for kind of hey this is just how it works behind the scenes?
>> Yeah. So this is really behind the scenes, right? So in terms of this Oracle AI developer hub repo, this is behind the scenes. If you're a a hands-on AI developer, you can go to this repo. You can look at notebooks that will touch on specific topics or you can actually get full apps. Right?
So this application I'm demoing actually lives right here in this um finance finance AI agent demo. So you can just get full apps. The team is building full apps in in this repo. But if you're a developer and you don't want to concern yourself with all of the in all of the nuances with implementing memory, the team here we released this package called Oracle agent memory just to solve it within it's a Python um implementation. We'll probably do um probably look at uh other languages to provide this in. So this is how you should look at but I want to go into what agent memory is and take your audience to a state where to where they could just understand what's going to happen in this space within the next let's say six months.
>> How does that sound?
>> Let's do it.
>> Let's do it. Let's go. So let's talk about the maturity of of AI application over this last three and a half years or so. We started off with LLM chatbots, right? Chat chippity came out onto the scene and we had this incredible um world compressor which is uh the LLMs behind a chat interface and you can ask it several questions. Then we found out that that was nice for for uh a few a few minutes but if you want to get real value you need to provide domain specific data. So we quickly moved into the world of rag based application which is retrieval augmented generation which is basically I'm giving a prompt to um an LLM but I'm going to supplement it with actual uh data that is going to help it answer better. this could be personalized data to me or information about uh context about the problem domain I'm trying to solve. But now today we've moved away from rag right um we're still using rag but in a more evolved um enhanced version. Today we're in this space of automation and autonomy. So automation essentially this is just LLM driven workflows and autonomy is you give an AI a bunch of tools and you tell it to go wild. Most people want autonomy, but what they need is automation, right? So, a lot of the customers we work with, they it's more automation, more workflows are automated. You understand step-by-step process of uh of any sort of task you're doing, and you look for ways to automate it, and you have an LLM involved in the automation or involved in the processes to make some minor decisions as opposed to Sorry, I think you have a question.
No, I was just going to say uh for for anybody in chat, if you have a question, uh if you want to ask Richmond anything, if if you want any clarity on any of the things we're talking about, just drop questions. I'll bring them up on screen and I'll prompt you, Richmond, to to try to answer them.
>> Exactly. Okay. Thanks a lot. So, automation autonomy is where we're at.
I'll go a little bit I'll go a little bit quick. Um there's a lot going on this on the screen, but this is just showing retrieval augmented generation, right? Where we have this data ingestion process. There's an embedding model which is able to convert data into this form that allows um the things we're seeing around vector search or semantic search right we have this you could pass in an image into an embedded model and you get this numerical representation over here >> crazy graph >> it's a crazy yeah it's crazy but I actually do break it down for the sake of time I can't but I do break it down in in the course um but one thing is you have this you could have the you could have this data that you've extracted projected in different forms you can have it in a JSON form you can have it as some form of knowledge graph or a a vector representation and then you can store this in a database and then retrieve it and pass information into an LLM when you're conducting um when you're trying to answer questions from your user through the user query. So th this is the main thing is this ingestion process right and then passing through an embedded model. I have seen some antiatterns where people have a bunch of databases within the AI infrastructure for each of this representation. That is an antiattern because what I am really focused on is reducing the cognitive load for both the agents and the developers. So I work with a lot of developers on the front line building AI systems and they they do a lot having to wrangle a bunch of system components, a bunch of different databases and I have to come in there and tell them look you don't want to be doing all of this. This is an antiattern. Use one database like a human you only have one brain. You don't have like five brains for different type of um information you want to store. There's one brain. you probably need just one database that gives you all of this retrieval strategies and that would be the Oracle AI database. But that's rag not too complicated to understand or even implement. But the key takeaway here is information today that your agents will encounter has this heterogeneous nature and you need different ways to retrieve them and pass them to your agents.
That's the key takeaway. So >> different form factors. Go on >> Richmond. I I've been thinking a lot about this. Um, one thing that we're trying to implement at Ford Future is I I want to take all of our data sources.
So whether that's like Slack messages, email, our videos and our video data, um, uh, a sauna, notion, our our internal notes, I want to throw them all into just a a single kind of ocean of data and allow my agents not only to pro like reactively if I prompted, hey, answer this question about the data.
fine, but I wanted to proactively look over that. I is is like what you're describing potentially a solution to do that?
>> Yeah, for sure. Because the data that you're ingesting, it can be represented in different ways, right? You have a lot of data sources being ingested as a centralized um uh repository. What you need is a way to then surface up information. Um and you there are different ways you can surface up information. You can surface up information by semantics which is meaning. You can surface up information by relational um relationships or you can surface up information by connections as well in one information to the other which is where you start to have knowledge graphs. So there there isn't just one way to surface information up. There are different ways. So having um all of these techniques in one spot can actually in one system can actually help you surface up information in the right way and put your AI infrastructure more consolidated and more efficient for your agent and for your developers to reason about >> is this assuming I already have all my data in in a database or like it yeah like go >> you can you can actually ingest test data. This is not assuming you have all of your data. So you can definitely have all of your data in a database, but a database typically can have an ingestion pipeline, right? So there are connectors you can have to to a database that will ingest this data. You can build a system that does it automatically from whatever sources you have. So if I go back to my diagram, I have this different data sources here, right? um it's it's a representation of different data sources and you just ingest that. You would have to maybe engineer an ingestion pipeline or use a third-party provider that helps consolidate this and then put that into a database like the Oracle AI database that will help you surface up intelligence with different retrieval mechanisms.
Okay, so just to carry on right in this world of agents and I'm going to move a bit fast. Agents have main components which is the ability to reason, perceive the environment, use tools, and have memory.
And once upon a time, I used to see definitions of agents that didn't include memory. And I it was such an obvious thing to me that you need memory.
Um earlier earlier definitions of when we're trying to define agents didn't include memory.
Uh so just moving on uh um we have things such as this agent loop right I'm sure people have heard about it which is this ability for your agent to actually get in context reason about the context and decide and decide about the next action and then by observing it could decide about the next action maybe by calling tools or maybe exit exiting the loop right this is what we call as the agent loop and you can go through several iteration of this to build up the context and actually um execute different action. Um and this paradigm is very powerful. But the key thing is as your context window starts to grow, one thing we observed is we needed we needed to actually engineer the information within the context window.
So we had to think about the signal to noise ratio of information that we're putting into our context window. and Anthropic wrote uh I've written a couple excellent blog pieces on context engineering. There are some that have come from folks like ignition uh lang chain and a few folks have written a lot on context engineering and how to engineer um information within the context window. So there's a few techniques and I do go into them um but for the sake of time I'll just talk about them at very high level which is one of it is context window utility.
Your LLM needs to actually be aware of its own limitation and one of those limitation is its context window capacity. There is no way your LLM can manage its context window if it doesn't know the current utilization of it and its capac capacity. So when you're using cloud code, you probably see a percentage value of the how much context window it has left before it starts to actually autocompact, right? And that leads me to another um context engineering technique which is reduction um or compaction of information within your context window which most of these tools actually do and that is just reducing the information in the context window so that you can actually continue to utilize um the current the context window without actually exhausting it to its limits or causing a lot of issues like context bloat or context rot. So you have this auto um compaction functionality within Clo where Clude automatically compresses information within a context window leaves a summary within it and then you're able to continue on with some of the conversations that you're having. We talk about how you can implement this by yourself as well um in some of the courses and some of the tools we release because it's not that difficult to actually implement but um there is difficulty in making sure that you have a high signal representation in your summary that you you've generated from your um compressed information. And then there is context offloading and context retrieval. So, I touch on all of this in in several places on my on my social profiles or in in courses I teach, but that's context engineering. Any questions in that, Matt? Do do your audience have any questions?
>> Um, I I think I'm going to bring Fox on the run back. Yeah, he's he's saying it's all because like we we need all of these workarounds because the models can't hold persistent state. I I think that's very accurate, but they can't.
And we have like I'm sure there are plenty of researchers working on continual learning, >> but we don't have that right now. And so all of these and and by the way, it feels like we're just scratching the surface on context engineering. And so there there seems to be a lot of gains left just here. And and maybe in the future, >> great, this won't be as relevant, but it it is extremely relevant today. So I just wanted to address that.
And yeah, there's still a lot of gains left. And um and one thing I would say is this, we need to look at things in terms of uh of timelines and and perspective, which is we're not the only architects of intelligence, and by we, I mean humans.
Another architect of intelligence is nature. And nature has had billions of years to create the the ideal form of intelligence, which obviously I'm biased, but it's human intelligence. But nature has had this experimentation over billions of years to get to where we are now. And us as humans, we're trying to do this in a compressed amount of time in AI as a field is has not existed for even a hundred years. It's probably 70 years now. So, we're trying to do all of this um I guess comparison to our intelligence in a very short amount of time and I would say we're not doing a bad job.
>> Yeah. So, uh Richmond, let me read this to you and you answer it. Uh so, Guy 15x, how should an agent decide what to forget given that storing everything creates retrieval noise and crossystem drift?
>> Yes. So learning how to forget is where the creative part of agent memory comes in because there is not one way to forget and even to make it more complic complicated we don't understand what goes on in the human brain fully.
Neuroscience is still an active area of research and neuroscientists still trying to understand how human memory works and whether is myth or legend um uh whether it's myth or fact humans the the the current thing that I've read from researchers humans actually never forget information like everything you experience in your day-to-day is stored in your brain the human brain just has a better way of surfacing it up right um this is what I would say in in a computational form You can implement mechanism of forgetting by looking at factors that you can you can attach to memory units such as recency, relevance and importance.
>> Right? So when you you can actually compute um you can actually compute these a weighted value of that is made up of several metrics and then use that as a way to either surface up information or reduce the the importance or the signal of other information within your system. And this is again is what the paper generative agents similar actually touches on. they implement a forgetting mechanism and even that simple implementation can get you um to a very good level of performance which I will show as well on some of the um charts and implementation that I I'm going to share in a second.
>> Okay. Uh one more this is more of a comment GMO >> I know this might be cliche but this reminds me of all the infrastructure surrounding the CPU L1 LN cache RAM etc. Yeah, I mean look uh I'll just give a brief thought and then Richmond jump in.
Uh Carpathy kind of said it right. This is uh the the the new architecture of computers. LLMs are the new operating system. And so if you think of it in that uh in that way, yeah, I I think you're right, GMA. There are short-term memory RAM, long-term memory, hard hard drive. Um the actual LLM is the CPU. Uh so I I think you're correct to draw that analogy.
>> Yeah, for sure. And is the same analogy I don't know who did it first, whether it was um Andre Kapi or maybe it was the MEG GPT guys, but the same an analogies used in a me GPT paper.
>> Yeah. Yeah. Um let me see. I think uh yeah, let's keep going.
>> Okay, let's go. Let's keep cooking. So the thing is a a simple agent interaction that is kind of like one-dimensional which is you ask the agent a question the LLM reasons about it and you get your answer back can it is good but it's not going to get you far enough and what I mean by that is in most use cases that are actually important to developers or to even general society you're going to be working on long horizon task so you need to have a system that can manage its own content text window and actually um give you the impression of learning and adapting to new information. So software debugging, right, coding is pretty much what everyone's focused on and that can be a a longer rising task. um research and and synthesis reports generation can be long arising and all of this requires you to have agents that can that can just iter that can have several iterations of the agent loop and yet maintain its context windows uh uh capacity at a decent level and not lose any accuracy or increase latency. So agent memory is that is what that uh the concept of agent memory is about. It's all about that how do we make agents have this reliable sense of information adaptation and able to actually be reliable, capable and believable.
>> So I'm going to go back to humans here.
Human are we we have I guess a very good uh good memory system and intelligence system that is based on memory as well.
But there are different types of memory in human.
>> Can I can I interrupt? I I want to put something on screen.
>> Hypersonic monkey brains. Uh great name.
Uh put the LLM in charge of its own context window. And um I like we kind of dabbled on this point earlier with OpenAI's feature like should the human be in the loop saying thumbs up, thumbs down, this is good context or not?
um the like a lot of what you're describing is human curated context engineering. I is there a future in which the model itself can just manage its own context or is that kind of you know snake eating its own tail type thing?
>> I think there is I think there is a future where we would have an agent that can actually refine its own harness right and optimize.
>> Oh yeah. Did you see that paper from a few weeks ago? Um, >> meta metah harness.
>> Metah harness. Yeah, >> there we go. That's right. Yeah, exactly.
>> Exactly. So, it it's it's obvious, right? If you just It's still early um in terms of what uh and someone just says, "No, mine is doing that now." Someone on the comments, so I guess it's not as early as most people think, but this is where we're at, right? Um yeah, that that is going to happen. I think 2026 is going to be is really focused around the scaffolding around LLMs. I think in the previous years we've been focused on the LLM capabilities, but we I think we've gotten what we can out of the reasoning capabilities of the LLM. Now we're focusing on this system around it and it's an important factor and you can see it's important because open claw is harness right and you have manus has a very good harness around um around the LLM that allows it to do several tasks and um I I know the the acquisition by Meta got blocked but it those guys are cooking they they've been cooking so the the system and the scaffolding around it and is very important. Having an agent that can actually refine that is an interesting um uh paradigm that is is definitely worth exploring.
>> Uh Richman, someone's calling out your Titan book. Can you talk about what that is behind you?
>> Oh, the Titan book.
That's good. Um it's about Rockefeller.
>> Yeah, >> it's about Rockefeller. So, um >> Yeah. Nice. Oh, I gotta read that.
>> It's a hefty book. It's a hefty book.
>> Yeah, good call. Um, all right. Yeah, let's see. Raymond Blackwood working on self-healing harness. Yeah, I think there's a lot of work on that right now. Um, Richmond, if you want to continue, please do.
>> Oh, yeah. I I stopped sharing, but let me let me get back to it. Um, let's do this and you can pop it up. Let me know when we're good.
>> Yeah, we're uh it was good. Now it just says loading on the screen.
>> It's going to come up in a second. There we go.
We good?
>> Yeah.
>> Okay. So, agent memory, right? And you can think of agent memory of this exocortex on top of the LLM that makes it remember and adapt. Um and there are different ways you can think of agent memory. And this is where we start to get more into agent memory. And you can think about it in terms of duration.
very easy. Short-term, long-term, whatever is in a context window is probably shortterm. Then when you start to put it within a file system or database, you go into this long-term realm. And I I've done a couple of um blog posts on file system versus database as memory substrates. And uh uh we can talk about that as well. that that was a debate that happened um in the beginning of this year that I was involved in uh whether you want to use file system or databases for your agent memory. But there's another special aspect of memory which is coordination which is more of a shared memory paradigm where you have multi- aent systems and you need to actually make sure they coordinate effectively and then start to look at cognitive function is another way you can distin distinguish agent memory and this is when you start to look at paradigms we have in human uh humans have working memory procedural episodic and semantic memory and we can look at the computational forms uh of all of these uh cognitive functions. So an example of procedural memory is going to be something like skills. Skills.md files are very popular. Markdown files that describe how to do certain tasks for agents are very popular. But this is procedural memory because what you're describing to the agent is a step-by-step process on how to conduct a certain activity the way maybe you've conducted uh conducted it previously. So uh skills.mmd files are basically SOP for agents um standard operate operating procedure documentations for agents essentially. Um, another type of procedural memory would be things like a a toolbox. So, one paradigm I talk about a lot is actually storing your tools in a database and surfacing them up when you actually require them using some form of retrieval mechanism. And this is a way to scale the access the the number of tools your agent has access to without um I guess uh bloating the context window. And we talk about this as well in a couple of materials that we've done and in a couple of courses we teach. Episodic memory is an easy one to actually remember. Just think about conversations you're having with a with a chatbot, with an LLM, with an agent.
The back and forth interaction is episodic memory because it has some form of time association with it and there other types of memory semantic memory and um shared memory. So that is agent memory and on this slide is a lot of in the early days people used to ask me Richmond why do you keep referencing human memory? Um I I guess because most people saw understanding things like a like a computational system like the CPU um parallel makes more sense but I talk to a lot of um nontechnical people um in my day job and that requires me to sort of draw parallels to what people understand and one thing that people understand is themselves and that's why I use human memory as a way to bring people into agent memory in the computational form.
And we've done this uh as as a society as as a as a society that is trying to um solve things using technology. So on I I'll go from left to right. Uh David Hubble and and Wisel were neuroscientists that experimented on the visual cortex of cats to understand how neurons uh in the brain reacted to visual images. Right? They did this in the 70s. They won a Nobel Prize for this. But what they researched actually impacted and influenced some of the things that we did around convolutional neuronet networks which was the dominant neuronet networks before transformers and we used them in uh most of the face detection systems or pose estimation or body detection systems that we have in our mobile devices or other devices we have. So you see this borrowing from neuroscience happening into the field of machine learning and deep learning. And if you go into more mechanical engineering, the first inventors of flights or explorers of of of vehicles of flights were looked to birds and and and and uh animals of flight and they looked to birds that flap the wings. And that's why the first attempt at creating vehicles of flight had literal mechanisms where there was a flapping mechanism involved. So again looking to nature inspiring uh mechanical or computational engineering and getting us in a getting us ahead or in a way that that we could have not done without actually getting this inspiration. Now planes move faster than birds uh today.
So that is why I use the human uh analogy here in human parallel. It just just helps. So memory when you have your database memory can be you can have different memory types being represented as tables within your database and it's still some of the same paradigms that we've used in in rag that we're bringing into this into this agent memory um topic. So that was the context engineering. But now there is something we need to really think about and I've been saying this for a number of years.
So I coined the discipline of memory engineering. So if you actually follow me on LinkedIn or you looked at my LinkedIn, I did something which was 100 days of agent memory and where I was talking about agent memory every single day and I was just sharing my learnings uh on LinkedIn and sometimes on on X as well. But memory engineering is important and you can see that this is what anthropic and open AI are doing today. Whether it's dreaming or whether it's taking human feedback, you need to engineer systems that can take information and actually um either create new information from the existing information or consolidate or forget or augment the memory itself. So this is different from engineering what's in the context window. Now you're engineering what's outside of it and even what's inside of it as well. So this is memory engineering. Um and then you have other as aspect of um uh systems that you need within your agent harness in a sense. So this can be looked at as a as an agent harness. And when you do all of this and this is the this is the interesting thing. But before I get into this, do we have any questions from your audience?
>> Yeah. Um not a question but this is super interesting. Fox on the run coming with the bangers. Uh I hopefully Lyenthal didn't copy feathers. He found the principle. Hubel Wisel showed continuous hierarchical processing. We took discrete tokens and called it intelligence aka we're basically building feathered planes. Do do you agree with that? Like kind of he's basically maybe uh giving his own version of the analogy.
>> Yeah. Yeah. I need to copy and paste that and just like digest that onto the live stream.
>> No, the the the crux is like we're we didn't identify something and build it in our own way. We're basically just trying to say like, oh, look, that's a bird. Let's make a big bird.
>> Yeah. Yeah. Yeah. And that in fact, we're still building feather planes is a good way of saying it. And I I think it's enough. It will get us where we need to get to if we actually apply the same paradigm to agent memory.
>> Yeah. Okay. Uh All right. Yeah. You're still digesting it.
>> Yeah. Yeah. I'm still digesting this. Is it?
>> Yeah. Grab a grab a grab a screenshot of that message. Yeah. Fox on the run. Very nice. Very nice.
>> Um uh am I still sharing my screen? Do we see like a graph on the screen?
>> Yeah. Yeah.
>> Okay. Nice. So look, the one thing about this is um we we don't just talk about this, right? We're in the arena. We're experimenting. We're putting stuff in Python packages. We're optimizing the database over Oracle to make sure we're creating a system that our developers, our customers can rely on to build more production ready, performant um agents.
Now on the screen right here in the red diagram and all of the code for those who are technical all of the code that we're seeing is uh is on the Oracle AI developer hub that I shared that Matt shared earlier on the chat. So you will see the code and you can experiment with this yourself. But on the red axis uh on the red line this is a naive um agent.
And what I mean by naive agent is it doesn't have any engineered memory. It's just basically appending on all of the appending each interaction to the next interaction for every iteration. Right?
So we're not engineering memory. We're not consolidating memory. We're not summarizing. We're not doing any of the fun stuff. But on the green axis, not on the green axis, on the green line, we have an engineered memory using the Oracle AI agent memory package. And as you can see, I ran this for 100 turns.
This was like an interaction for back and forth. We went back and forth 100 times. And you can see the token consumption go up for the non-engineered agent. But you can see the token consumption for the engineered agent stay relatively stable.
So in a sense you can almost run this for I don't know maybe a thousand iteration and you would have infinite context in in in this rough way of using the word infinite context. I've not done it for 100 for a thousand iteration but maybe I'll do that but this is just a 100 and you can see it's stable. Now people will ask me hey yes you have lower token utilization within each iteration but is it accurate? Can the agent actually remember? Well, in the notebook, one thing I did was actually implement um an LLM as a judge in where we give the we we use an LLM to actually gauge each of this responses to questions and see which one it prefers.
Right? And here we can see the LLM actually preferred the engineered memory rather than the naive one and there were cases where it was a tie. So this is just using NLM. So now we are maintaining token consumption right keeping it efficient and still maintaining a sense of accuracy.
Let's put a pause there. Any question?
Let's let's digest this.
>> Not not a ton of questions. We have a lot of comments though. Okay, I I'll show a few. By the way, if if chat if you have comments, if you have questions, drop them. I'll we'll we'll talk about it right now. Uh hu huge reaction. Those that are not technical uh technical just give it to codeex.
Yeah, basically. Um and by the way, uh we'll put the GitHub link and everything in there. Um let's see.
Gareth Hood. Your brain runs on seven networks. Default mode, executive salience, two attention ones, sensor motor, sensory motor, and visual.
>> Yeah.
>> Okay. Um, yeah, that's about it. Let's keep going.
>> Yeah, let's keep going. Those are some good comments and I I think you have some you have a different variety of folks on on your audience and this is very good. Uh, we'll keep cooking. So here we can see obviously we're maintaining accuracy, we're maintaining operational cost and there are other things right some of those some of the more technical folks would probably think about okay this is great you you're maintaining operational cost by token consumption you're also maintaining accuracy but how about the KV cache how about um let me actually just scroll down to this to this graph Uh how about the KV cache right which is how about this implementation of how about this this mechanism of storing or at least making the operational of making the inference a bit faster um that the LLM frontier labs actually provide you. So um OpenAI has automatic prompt caching which is if you are actually maintaining the the representation or or the actual structure of the prompts that you're sending for inference or whatever is your context window to your inference provider. If you don't change it, they will actually store the computed attention for for these and not have to recomputee it, which is save on latency and actually save on um operational costs as well from their end. Right?
That's KV cash. I see some people in your on your on your audience know what I'm talking about. But yes, there is that um yeah, it does seem like overnight everyone started talking about KV Cash to be fair. and Cognition the folks at Cognition and and Manis were early to talk about it and they shared about the context engineering um solutions that they were using to maintain KB cache. So I did an experiment in the same notebook is where we have a naive end to end um >> where where can people find this >> this notebook or the >> the notebook >> this note yeah for sure um in the Oracle AI developer hub is where we is where the team is always cooking and we're serving everything there. So um let me put the link again to this particular notebook that I'm show >> yeah drop it in our chat and I'll put it in uh main chat.
>> Drop it in.
So we are always cooking. So yeah. So exactly. So KB cash was something that came up and >> of course >> sorry go on >> Brian Brian given his uh great commentary in in chat. Thanks Brian.
>> So um so what we're seeing here is this right? You have this the K you have this naive end to end a naive end toend model is actually uh a naive net toend memory implementation where you're just appending information to the context window and then serving it up is actually going to maintain the KV cache on your frontier lab or model provider side. Right? So you're going to see low latency um in terms of end to end latency in terms of um what's going on on the inference side from retrieval to inference and getting your response back. That's what you see on the red right KV cache has been maintained. Then on the green you have this optimized um AI uh AI agent with all of the memory engineering technique and you can see it's a bit higher in terms of latency but on the blue is a happy middle is is this happy medium where you're using some of the actual um implementation of memory engineering that we talk about and you're actually able to maintain an end to-end latency that's comparable to when you're maintaining KV KV cache. I'm going to do a blog a blog on this very soon. Um uh I'm going to do a blog on this very soon when I come back in a few weeks and we can talk more about this.
Right? This is where you go into memory engineering. This is where you start thinking of operational cost where you start thinking about how can I do something beyond what the frontier labs are doing and build my own system because everyone is building their own harness today, right? Everyone's building their own knowledge bases after the tweet from um Andre, right? Everyone started building their own knowledge bases. And I think everyone is going to have their own beastful harness.
>> Really? I mean, when you say when you say everybody, you mean like everybody but within like a very narrow group of people, right?
>> So, this is what I would say. I I think and this is what we're seeing, right?
the the barrier to entry with software engineering and I use that term very loosely today. Um the barrier to entry to software engineering is it's it's quite it's lower than ever.
>> Yeah. Right.
>> I mean we literally had a comment earlier that was like for those who don't understand just give it to Codeex.
>> Just give it to Code. Exactly. So you can give codeex >> um you can give codeex a blog post >> and just say can you reimplement everything that is being said in the blog post and use it for me to be able to do a certain task or for my specific workflow and this is why it's important because memory or the implementation of memory can differ depending on workflow or what I call application mode right there are three main types of application mode we're seeing deep research assistant and workflow, right?
So, deep research is when you're going deep on like a particular topic and you're scrowing the internet for different source of information and you're creating the end result is um an extended amount of um reports back to the end user. Then an assistant is a back and forth conversational agent. But I think you have a question.
>> Yeah. So, so let's do this Richmond. Uh if there's any more questions, drop them in chat right now. Let's uh let's try to answer them. I think Guy I5X has a really good one. I'm gonna read that off. You could start getting that and then answer it and then and then we can uh close out this section. Uh and >> okay, so uh Guy I5X does prefilling the KV cache with consolidated memory embeddings at session start meaningfully reduce retrieval misses compared to on demand loading. Now before you answer this Richmond >> for those of you for those of us including myself who maybe understand three of the words in this question break it down what is he asking and then please answer.
>> Does prefilling the KV cache with consolidated memory uh embeddings at the session start meaningfully reduces retrieval misses compared to ondemand loading? What I this is reminding me.
Okay. So let me break it down. So prefilling the KV cache of consolidated memory embeddings. So what that is saying is when you when you've consolidated memory these are information that you let me use it in in in what we can actually understand today. Okay. So you know the way anthropic has this dreaming feature now implemented now they don't have the the model doesn't have to actually determine which information does it need to prioritize within its own reasoning logic. Now this consolidation is solved it. So you prefill that into you generate embeddings of it and prefill that and put that into the context window and so you avoid the premputation of that. And there is a paper um called kag c a that came out last year that talk that spoke about actually computing the kv cache and actually just putting that into the context window so you don't actually have to reimplement that and I'll get the paper up and I can share this. Um but the answer to the question is I don't know. I don't know.
I'm going to need to experiment with this. I'm going to need to experiment with this. Um, let me do that. I'm just going to get the paper for CAG is it was called cash augmented generation. And I think this is what this person is referring to.
And yes, I've got the paper and that paper came out in 204.
And I'm just going to share that with you Matt and you can share that with the audience.
Okay, I'm putting it up.
>> Right. So, that paper was basically you premputee um the KV cache. So, you'd have to do the pre premputation on the end of um the inference provider and then you you save on all of the all of the operational costs in another and you can avoid doing rag as well. Was this papers um argument? But I haven't seen an uptake of this. I've not seen this been implemented. So maybe I can revisit it and do an implementation of this.
>> Um, all right. So final thoughts, Richmond.
>> Okay, let's go. Final thoughts. I'm going to stop sharing. This is my final thoughts.
Everyone is going to everyone is so heads down in 2026 focusing on the harness, right? And I think it's important and this is an important thing which is everyone can now participate.
Everyone can participate in what's going on in the space today and agent memory gives you that entry point into um having >> having your way of doing things in terms of your creative techniquation or at the very least at the very least understand what's going on. So my final thoughts is >> there are several materials I've provided you and me and Matt have provided you this conversation check out hit me up on LinkedIn um hit me up on X and uh we are we are just going to be in the arena check out a memory package and implement it in your system and then give us feedback give me feedback and we will reflect that either in the Oracle data itself or in the package as So, we're listening to And I think there's an echo.
>> Uh, is there Oh, there's an echo all of a sudden. Uh, >> okay. I'm going to I see. Boom. I got it. Okay. Uh, sorry about that everybody. I'm still figuring out how to use things. Um, I actually just dropped that deep learning link once again featuring Richmond, uh, Andrew Ning. So, go check that out. It is free. If you want a really great deep dive on agent memory, that's where to get it. I just dropped it in chat, so go check it out.
And yeah, sorry about the echo, everybody.
>> Um, Richmond, I really appreciate you coming on, sharing all your knowledge with us. Please come back again. Uh, and um, >> yeah, we'll we'll uh we'll have to bring you back on soon.
>> All right. Thanks a lot for having me and thanks a lot for the good, uh, questions, guys. All right, see you original.
All right, I think we're going to call it there for today. I I meant to play around with the GPT voice, the new one.
Um, I just don't know much about it yet, so I have to actually go test it. So, I'll go test it. Maybe we'll stream again tomorrow, uh, once I have a had a chance to really check it out. So, until then, uh, see y'all later. Go check out forwardfuture.ai.
Follow us. Follow me. All the channels.
Uh, you know, we're everywhere. Twitter, YouTube, Instagram, Tik Tok, Twitch, all of them. Hope you enjoyed. See you later.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











