Agentic search is an autonomous loop where AI models use search tools (such as hybrid vector search, grep, and document retrieval) to dynamically find and store relevant information, addressing the critical challenge of context rot where language models become unreliable beyond 40,000-100,000 tokens despite marketing claims of million-token context windows. This approach enables agents to handle both reading (what information to reference) and writing (where to store new knowledge) paths, effectively managing the limited context budget through intelligent curation rather than relying on raw token capacity.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
AI Dev 26 x SF | Jeff Huber: Everything You Need to Know About Agentic SearchAdded:
My name is Jeff Huber. I'm the co-founder and CEO of Chroma, and today we'll be talking about agentic search.
Um, maybe by kind of I always like to start with a quick poll of hands. Um, how many have built an application with retrieval augmented generation or rag before?
Okay, most of the room.
Uh, either how many people have either heard of Chroma or have used Chroma? All right, amazing. It's like most of you.
That's great. Um, okay, well, between the options of giving like a thinly coated sales pitch for Chroma and doing something more interesting and more sort of educational, maybe even more risky, um, I'm choosing to go for the latter.
And so, uh, if I totally bomb, you could let me know later. But I think it'd be more fun than just again, a a thinly coated sales pitch.
So, by way of quick background, you know, we at Chroma are obsessed with context. Um, we are the makers of Chroma DB, the most popular open-source solution for search and retrieval. Uh, we have over 27,000 GitHub stars, over 140 million downloads, and you may know Chroma for its local kind of build on your own computer experience. Uh, there's also a distributed and serverless cloud offering now that serves hobby projects all the way through some of the largest enterprise workloads in the world.
Chroma also is now known for a model that we train. We'll be talking about a little bit here in the slides to come.
It's an open-source model, 20 billion parameters, called Context One. It is state-of-the-art at the task of agentic search. Again, finding the right context.
And then lastly, Chroma is well known for our research. Uh, many of you may have seen our report on context rot that we published in the summer of last year, really, uh, bringing to light the limitations of long context for tasks that require reasoning.
Now, context rod is cited routinely by Anthropic and OpenAI in their new model releases.
Also, we were sort of key to spreading the term context engineering. So, we are very obsessed with context.
So, I first want to zoom out for a second and try to define a few terms.
At Chroma, we do not believe that AI is a techno machine god which is going to come and solve all of humanity's problems. We believe that AI is a tool and that AI is useful.
This is actually one of our stickers. We have a few sticker packs at the booth. I can see it from here. So, run over there and grab it if you like this sticker.
And to reason about this, I think it's always helpful to start with what is a computer. We've all used computers for a long time.
A traditional computer is a universal structured information processor that is capable of simulating any effective procedure or algorithm given sufficient resources and a proper program. Again, kind of an academic definition here. I told you I was going to be reaching a bit. So, what is AI? I think it is fair to say that AI is a universal unstructured information processor and one that can simulate any intuitive procedure or reasoning given sufficient resources and the proper context.
And so, it derives that AI in some sense is just this. AI is just context and reasoning.
And my contention would be reasoning has gotten a lot of focus and a lot of investment and context is still very much underrated.
Now, if you look at the data, this is a report that OpenAI put out in the fall of last year.
The title was how people are using chat GPT.
You can see that actually over 45% of chat GPT queries are asking centric.
There's a question at the heart of the user section and the user's query. And of course, if you consider tasks like doing, whether it be for seeking information or coding or other, you may also into it that referencing grounded information is useful kind of even the cases where you're doing something.
Now, I think all of us sort of see the writing on the wall that AI is coming for a lot of information work.
You know, we will no longer be sort of in the fields ourselves, you know, manning the shovels and the trowels.
It's going to change.
Labor is going to look a lot different.
And so, you know, the average human information worker spends something like 30% of their time seeking and finding the right information to do their jobs correctly. And of course, agents are going to need the same set of capabilities.
So, agents both read context. I think that's been discussed a lot in the context of agentic search. Agents also write context. And we believe that agentic search is useful on both sides of this puzzle. So, to call these out specifically, the agent at runtime needs to ask what information should I reference? They need to retrieve the right information. Again, this is the read path. But also on the right path.
If they learn something, if they derive something, they need to ask the question, where should I store this? I have a huge corpus of data, potentially it's markdown files in a file system or other, and they need to locate all the relevant places to update their prior knowledge in order to have a universal and consistent picture of what they know.
And so, storing and learning the right information is also incredibly important to the problem of solving ongoing context.
One other analogy that I'm going to reach for here is biology. So, what can we learn from our own brains and how they operate?
Our brains are incredibly evolved and sophisticated systems um, that have arguably some level of a bifurcated workload. If some of you have read the book by Daniel Kahneman, Thinking Fast and Slow, this idea of a system one and system two thinking will probably be intuitive to you.
Um, but in short, the our prefrontal cortex is slow and expensive. It has high reasoning, but it has narrow attention.
In contrast, our sort of the rest of the brain has is very fast and very cheap.
Uh, it's sort of relatively reasoning, you know, it's not the most analytical part of our brain, uh, but it has the ability to pay close pay attention to broad amounts of information.
Um, and it has the ability importantly to bring to the attention uh, your prefrontal cortex what you should be paying attention to. Um, and so, you know, you see in any kind of sufficiently complex system that has a convex cost curve, um, you see often a bifurcation of workload. And again, I think reasoning and context uh, very much play directly uh, into this uh, natural biological analogy.
So, uh, not all is perfect though. One would hope, well, we just throw it all into context and we let the agent figure it out. Surely the agent is very smart.
Uh, we don't need to overly curate the context window. And um, you know, hopefully I'm not repeating too much here for those of you who are familiar with context rots. If you'd like to go in depth, there's a 40-page or so report um, at the URL on the screen. But again, last summer Chroma demonstrated in this long report that language models are not invariant to their performance to either the amount of context that you use across either reasoning or attention.
So, said another way, uh, the language model labs may market to you that, "Hey, we've got a million token context window or two million or 10 million." And yet, when you talk to actual builders in the field who are very much attuned to either their vibe based evals or their actual evals, they'll tell you, "We don't trust it past 40,000 tokens or 100,000 tokens before it enters the dumb zone." And uh, you know, once it enters the dumb zone, it's a coin flip whether it's going to work or not. And of course, nobody wants to build software where it's a coin flip of whether it's going to work or not. And so, context rot is incredibly important to know at uh important to know as a builder cuz it changes how you build your system.
Again, you even if you're marketed to have a million tokens, you may only want to use 100,000.
Um this particular graph in this particular benchmark here is a repeated words benchmark, and you can see that many of the state-of-the-art models we need to rerun these numbers pretty extensive to do uh for the latest generation of models, um but it starts to fall apart as early as 1,000 or 10,000 tokens, not that much context.
Additionally, this motivates context engineering. Many of you may have heard this term. So, context rot is true, then well, we can't throw it out of context and hope and pray it's going to work. Uh we actually have to curate this information. And uh you know, this is really the task of context engineering.
You may have also heard a term called harness engineering, and harness engineering can be thought of as the equivalent to context engineering. Um maybe context engineering is like you the human do the searing versus harness engineering is like you give the models the tools to do its own context engineering in the harness. Again, all these terms get kind of confusing, I agree. Um but you can see there's a lot of information actually contending for a fairly limited budget. You know, think about like 120,000 tokens, it's something like a third of a floppy disk.
It's just really not that much information um if you think about it.
And there's so much information that's contending. I'm sure all of you are Codex and copilot users, and uh you know, when those compactions come along, you really cross your fingers and grit your teeth that like it's not going to wipe away all of that hard-earned knowledge that you've sort of built up over the session. So, context engineering, incredibly important. You know, I would assert that today that most AI failures are no longer reasoning failures, but are now context failures.
And I think we really see this uh really ever since Opus came out over Thanksgiving 4 5, um the models now are good enough for most kind of long-running agentic tasks, but still the context challenge remains high.
If we had our way, we would like to be able to design a system that can achieve all of the quadrants of this 2x2. We want to create a system that can handle simple queries with a little bit of information, simple queries with a lot of information, but also complex queries over a little bit of information, and complex queries over a lot of information.
And to ground this, what is a little bit of information? Well, a little bit of information might be a single 10-page PDF or uh a single Git repository, for example. A lot of information would be all of your personal information, all of your company's information, or even all of the world's information. And if we look to build agents that are more and more and more advanced, that can do more and more and more useful economic work autonomously, they're going to need to access to more skills, more tools, more information, and again, this really motivates the need to curate the context incredibly well.
On the query dimension, so simple queries might be what is the capital of Kyrgyzstan or what are the features of Chroma sync?
Um, I won't answer that now, but again, happy to answer that at our booth.
Whereas complex queries might be what supplements should I take? That requires a lot of personal context to know about you and your health to answer that question well. Or a question like this in a company context, how is our positioning on enterprise or self-serve shifted across the last four all-hand board decks and the CEO's recent podcast? And where is the current roadmap doc contradict the latest direction? Again, we think this is an incredibly valid question to ask of data. And in fact, agents in many ways want to ask questions that are this hard, and we humans want to ask questions of agents that are this complex. Um, and yet, of course, it seems to pose uh a a of challenges, at least to conventional retrieval technology, like kind of classic rag for example.
So, if the goal is to build agents that just work, we need to handle the full spectrum of these queries. And I think we all desire to build tools for ourselves and build tools for our teams, and build tools for our users and customers, where we can hand them the reins and they don't run into the rough edges, and it just works. And so, I actually think that before, it was probably a fool's errand to try to like handle all these quadrants out of the gate. Um now, it might be a fool's errand to not consider all of these quadrants out of the gate, cuz it really changes how you build your system.
So, how can we solve context? Well, as it turns out, it's probably just more tokens.
Uh which, you know, at Prophet would love me saying on stage, you know, always be token maxing.
Um and uh you know, maybe that's true to some degree.
Um in the loop of the read path, so what should I reference? Agentic search is very useful. We'll talk about what agentic search is in a in the next slide. Uh similarly, answering the question of where should I store something is also agentic search. And again, if you're your Claude code users today or Codex users, your models are already doing this today. You watch it, or maybe you don't watch it, but if you do watch it, you watch it, you know, kind of running these sub-agents and and gathering context and understanding where to write information. So, okay, at long last, what is agentic search? It's quite simple. Agentic search is a loop where a model, we refer to this as a search agent in this case, has access to a set of tools. These are generally search tools. The model also has the ability to stop its own loop, so the model can decide when to stop.
And then, you may give this model access to various search tools. So, for example, for the model Chroma context one, uh we gave it access to and it was trained on uh hybrid search, which is uh dense vector search plus sparse vector search, trading off uh semantic and lexical uh pros and cons.
Uh we also support grep search inside of Chroma, so or So, you can run uh regex queries on top of the full text search engine inside of Chroma. And so, you can grep logically against your data in Chroma. And then last of we gave it a get document tool.
So, it could kind of get all the chunks and put them back together and kind of just get the document again. I think it just makes a lot of sense. It's very intuitive. It's exactly how we all use Google.
We type in a query, we see some blue links, we see some previews of those links, we pop up in the ones we're curious about, and this is how we manage our own context engineering in our own minds. And I think Chroma makes sense.
This is how, you know, agents are going to solve context engineering as well.
Now, this problem is not purely solved in the sense that there's all kinds of interesting questions of how should the agent balance exploring information and exploiting information. This is kind of a classic information retrieval trade-off. You know, how should the agent understand when a certain retrieval strategy is not producing valuable results? In fact, we see often frontier models today being very bad at this.
They'll do a search, they'll kind of get back crap, and they'll kind of just keep trying that search over and over again with very subtle modifications, but they're not diverging sufficiently to get to other parts of the of the search space. Other questions are, well, how do you know when to stop?
If you stop too early, you might miss relevant results, but if you stop too late, you're just wasting time and money. And so, there's a lot of interesting complexity in agentic search, and I think it's important to know about this as a builder, but I also think these capabilities are, practically speaking, going to be folded folded back into the models themselves.
So, over like a 6-month or maybe 1-year time horizon, I don't think you you can have an intuition this is probably important as a builder, but actually hand curating this probably not important. So, what does this look like?
How do we solve context? Well, we use more tokens, we use sub-agents. And sub-agents are very effective at encapsulating context, which also protects the context of a reasoner model. We don't want to overwhelm or distract the reasoner model, and so sub-agents are very useful for encapsulating context, Um, can be trained to do things like Chroma Context 1.
Um, so this is a model we released Apache 2 open source about a month ago.
Um, it is a 20 billion parameter model.
So it's something like 50 times smaller than most frontier models. It also runs at uh, 400 tokens per second on Blackwells.
It runs at 3,000 tokens per second on Cerebras. And that's in contrast to 40 tokens per second on average what you'll see from Opus. Um, so it's very fast and it's also very cheap.
Um, it's a MLE. It's 20 million parameters and so I think it's like a dollar per million output tokens.
Instead Opus is $25 per million output tokens for 4.6. And I lost track of what, you know, 4.7, 5.5 cost. They're all even more expensive. Um, and you know, we've demonstrated that you actually can train small language models to be very good at these tasks. So this is a task of F1 score or kind of accuracy you can kind of reason about versus the average uh, latency. There's also a graph on our research page that shows F1 score versus cost. And you can see Chroma Context 1 actually what we would call defines the Pareto frontier of accuracy vis-a-vis and relative to latency and cost.
Um, and again, I think this wasn't known to be possible um, before we tried it.
We had a strong thesis this would work and it did. Um, it's called Chroma Context 1 sort of implied the existence of Context 2 and 3. Um, definitely watch for those uh, from us in the near future. There's some pretty exciting stuff coming down the pipe.
Um, the last thing I want to leave you with um, before kind of jumping to some more predictions about where this model might go is this idea that actually if the tokens per second goes up. So if the model is no longer, you know, 40 tokens per second where the language model is really dominating the latency of your application. If the models become much faster. So again, Context 1 today runs at 3,000 tokens per second on dedicated ASICs. Context 1 can run at 15,000, 17,000, 20,000 tokens per second.
Um you change the way you think about your architecture and you start to want to sort of push down your compute and run it next to your data. For those of you that have done, you know, kind of OLAP stuff before, this idea of doing push down is not going to be unfamiliar.
Um but the reason for the intuition here is like you don't want to be paying a lot of network costs. If you're going back and forth across a long network uh pipe um over and over again, you're just wasting money uh and wasting time. And so this idea of pushing down the search into the data layer is something that we are actively working on and sort of very, very, very interested in uh for obvious reasons.
Okay, so we're going to do a quick kind of demo and a quick race here. Um last year when I spoke at this conference, I did a demo of teaching Chroma how to play Doom, uh which unfortunately was a very high bar to place, I think, on my my first talk at this conference. So I don't know if we're going to really beat that today, um but I do think it's really important to feel this. Again, I've been saying you a bunch of numbers and I've been showing you a bunch of graphs and we love that stuff. But I think the ability to feel it really changes and has the chance at least to rewire how you think about both like what is the state of the art today and then also where is all this stuff going in the future. So we're going to compare two systems. We're going to run it head-to-head. It's uh system X versus system Z, the blood of your mediator thing. Um in the case of system X, this is sort of approximately Opus uh 4.
whatever version you want. Um and then the system Z is Contextual running on Cerebras. And so hopefully this video works uh with the Wi-Fi.
Um so I'm going to pause it and explain what this is. So um on your uh left-hand side, you can see this is the system X.
This is Opus. And then on the right-hand side, this is system Z. Uh this is Contextual running on Cerebras. The goal of this exercise is to fill up the memory bank. So if you see that uh top right grid, uh that is what I'm calling memory bank. And in order to be successful and move forward to the next step of this uh system, you need to fill it up. Um and so the top layer, you'll see pulsing. Uh that's the intelligence running. That's the model itself. Uh the middle layer is the map. Kind of think about the index over the raw data. And then the bottom, that's the raw data.
So, kind of what I mean to think about this is like the middle layer is like the hybrid search get chunks. And the bottom layer is like the get document.
Okay. So, uh I'm going to run this. Uh watch closely.
It's going to go by fast. Um all right. We're going to click play.
And here we go.
Okay.
Uh it's going to loop now. I'll let it keep looping uh while I explain what just happened.
Um so, what just happened is that on the right-hand side, uh Context 1 on Cerebrus with fast index local search, uh completes the entire task before Opus completes a single round trip.
Um and again, this is only running at 3,000 tokens per second. Um within a few months, you'll be able to run Context 1 at 15 to 17,000 tokens per second. Um even faster. And I think this this speed thing to me feels like the largest uh call it secular trend, which is not priced in and is not popular on Twitter, um but should be important to how you think about building applications. Uh today, we're very cautious about when we use language models. Oh, it's going to be slow. It's going to take a while, especially in kind of iterative tool calling and loops. It feels risky. Um and so we try to avoid it. Uh all of that's going to go away. And um in the future, we're going to fully embrace the bitter lesson. And the bitter lesson applied to search is just spend more tokens and let the agent figure it out.
So, I'm going to close with three predictions for the future of Context.
The first prediction is that Context will become continuous. What I mean by this is today, we do a full generation, and then we take a step, we take a pause, maybe we go retrieve some more contacts. I think in the future there's going to be a steering layer which is continually guiding or steering the reasoning model and both a pull sense where the reasoning model can go to this layer and say, "Hey, give me information about this." But also in a push sense, this layer can say, "Hey, interrupt. You kind of forgot about this." A push sense, you know, please pay attention to this, you're missing something.
The second prediction is that it's going to be extremely fast. We've already been demonstrating this in the visualizations in the prior slide.
Uh small language models are going to be an incredibly part of the stack um and will be the dominant tool by which builders use to build uh systems that have powerful context abilities both again on the on the read path but also on the write path figuring out where should I write this information that I just learned.
And then lastly is continual learning.
Um everyone loves to talk about continual learning. It's incredibly cool and advanced stuff that the labs are working on in this regard. Um but I think in practicality over the next call it one, two, or three years as a builder, you should view continual learning as living at the context layer.
So you won't be updating the weights of your reasoning model, you won't be fine-tuning GPT 5.8. Um but you very well will be adding additional knowledge to your context system that teaches the system about how to do things and where to look for things. And additionally, one of the things that's really promising to us about context one is that it costs merely dollars and merely sort of minutes to hours to fine-tune.
It's incredibly cheap to fine-tune. And so, you know, when we think about how do systems learn over time and how do they integrate that knowledge in both their knowledge knowledge on the kind of the document storage side but also their intuition on the weights side, I think that will be solved at the context layer. So, in closing, uh we believe that context will be solved this year and uh thank you for your time and and say hello at our booth.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











