拡張機能をインストールして、あらゆる動画内を即座に検索しましょう

Make Your Mastra Agent Cheaper and Faster with Prompt Caching
追加: 2026-05-12

142 回視聴176:23mastra-ai元のリリース: 2026-05-12

Prompt caching can reduce LLM token costs by up to 90% and latency by up to 80%, but it only works when the beginning of the prompt matches exactly; to maximize cache hits, developers should structure prompts with the most stable content (system instructions, few-shot examples) at the top, followed by user-specific information, and place dynamic or session-specific content (working memory, last n messages) at the bottom, as changes to earlier tokens invalidate the entire cache while changes to later tokens only invalidate the portion after the change point.

[00:00:00]Prompt caching can reduce token cost by 90% but here's the thing most people get it wrong. They assume that because prompt caching works automatically with open AI, you don't have [music] to think about it. But what if I told you you can optimize your agents to hit the cash more often, save money, [music] and reduce latency by up to 80%.

[00:00:24]I'm going to show you how with Mastra but first we need to understand the fundamentals of prompt caching.

[00:00:30]Providers like open AI charge based on the number of tokens they process. With 5.5 you pay five bucks per million in and 30 bucks per million out. It makes sense to pay for output that obviously takes computes and creates the value we're willing to pay for but why charge for inputs separately? Turns out input tokens also use computes because every token has to be preprocessed through the model to produce the KV representations used by the models attention mechanism.

[00:01:03]The good news is if open AI has already seen and processed the exact same prompt prefix in the last 10 minutes, it may be able to skip doing that expensive work again because the result of processing those tokens has already been computed.

[00:01:19]And to their credit open AI pass those savings back to us and charge 90% less for cached input tokens. This also reduces latency especially for longer prompts.

[00:01:32]So how does prompt caching work and what does it look like in practice? It depends on your provider. With open AI caching is enabled automatically for prompts that are 1024 tokens or longer.

[00:01:45]Here for example, I have a Mastra agent with a long knowledge base system instruction. It's about 10,000 tokens.

[00:01:52]When I run the agent from a cold start, I pay for all 10,000 input tokens.

[00:01:57]OpenAI hasn't seen this exact prefix recently, so there's nothing in the cache to reuse.

[00:02:03]That said, if I send a follow-up message, I only pay full price for the new input tokens in that last message.

[00:02:12]The roughly 10,000 tokens of context from earlier are cached, and they are reused at a much lower cost. Now, here's the most important part, and really my whole motivation for this video.

[00:02:25]Prompt caching only works when the beginning of the prompt matches exactly.

[00:02:31]That is to say, the cache only applies up to the first changed token. So, if you change something right at the beginning of the prefix, you invalidate the cache entirely.

[00:02:42]If you change something later in the prefix, you may still get a partial cache hit for everything before that point, while everything after it still has to be processed as new. Project Discovery, the team behind Neqo, an autonomous security testing platform, were only getting a 7% cache hit because their working memory changed on nearly every step, which kept invalidating the bulk of their cache. Once they fixed that, their cache hit rate and savings increased dramatically.

[00:03:13]So, how do you get this right the first time and avoid wasting money? My first tip is to avoid putting dynamic values like time or the request ID at the top of the system prompt. In this case, as soon as the clock ticks just 1 minute, the beginning of the prompt changes, and that effectively invalidates the cache every 60 seconds. Additionally, I recommend you structure your prompts so that the most static context goes at the top, and the most variable context goes at the bottom, maximizing cache reuse.

[00:03:46]Suppose you insert dynamic data that changes occasionally, like working memory near the top of the prompt. When that memory changes, everything after it is no longer cached. So, in my case, the large stable knowledge base contacts comes after the working memory, which means that when the working memory changes, I invalidate the knowledge base contacts and with it the bulk of my cache.

[00:04:12]That's no good.

[00:04:14]The takeaway here is to always put the most stable content first and the more variable content towards the end.

[00:04:24]By making this small change, when the working memory updates, only the cache from that point onward is invalidated and we preserve the bulky parts of the cache we really care about.

[00:04:35]Zooming out a little bit here, it is helpful to think about your prompt as a series of blocks.

[00:04:41]At the top, you have ultra stable contacts, like system instructions or few-shot examples that never change. In theory, this block should fill the cache and stay warm during business hours as users keep hitting it, so you only need to pay for it once. In practice, cache hits are not guaranteed. There are lots of reasons you might miss the cache that are out of your control, but we are going to control what we can. So, next, add stable user-specific information in this next block. These values change from user to user, but they don't change over time, so they can be cached across user sessions.

[00:05:20]Towards the bottom, add session or task-specific contacts, like memories, retrieval data, and intermediate state that changes frequently.

[00:05:29]Of course, you'll likely want to include the last n messages as well, maybe the last 20 or something, but make sure that does go at the bottom because it is a sliding window. Once you pass that n message threshold, maybe 20, this block gets invalidated with every new message, so it's a bit volatile. I hope you enjoyed this quick video and now have a solid understanding of prompt caching and the most important optimizations to consider.

[00:05:55]Prompt caching is not unique to OpenAI, okay? Anthropic, for example, supports it, too. Although each provider have their own mechanism, [music] their own pricing structure, and their own nuance as well.

[00:06:07]The good thing is, now you understand the fundamentals, you can refer to their documentation or interrogate an LLM about the nitty-gritty.

[00:06:15]I've been Alex Booker at Maestro. Thank you for watching.

#mastra #mastra ai #prompt caching #openai prompt caching #anthropic prompt caching

関連おすすめ

コンピュータサイエンス

Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)

theprophedu

636 views•2026-06-04

コンピュータサイエンス

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

コンピュータサイエンス

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

コンピュータサイエンス

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

コンピュータサイエンス

Making Minecraft Clone with C++ & Raylib

PecaCSLive

686 views•2026-06-04

コンピュータサイエンス

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

コンピュータサイエンス

Instagram accounts got PWNed

EricParker

13K views•2026-06-03

コンピュータサイエンス

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

トレンド

Why Batman Lets The Joker Live 🤨

zackdfilms

9222K views•2026-05-30

They're Complete Trash

penguinz0

558K views•2026-06-04

Paris is in SHAMBLES right now 😭

H1T1

4053K views•2026-05-31

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30