AI agents require three critical infrastructure components to succeed: high-quality inference (where the same model can perform differently across providers due to software implementation differences), robust tool calling capabilities (with 55% of requests containing tool definitions and 83% of models using them), and reliable call success rates (where malformed tool calls can stall agents and success rates vary significantly by provider). OpenRouter's data reveals that agentic token usage has already overtaken human usage, and agents burn far more tokens than anticipated due to heavy context loads including tool definitions, MCP gateway definitions, skill front matter, and reasoning steps. The key insight is that the hard part of running agents in production is no longer just picking a good model, but ensuring inference quality, tool-call success, and failover capabilities are core architectural considerations.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
28 Trillion Tokens a Week: What OpenRouter's COO Sees About AI AgentsAdded:
CEO. I'm the uh CEO and co-founder of OpenRouter. Uh we are the world's largest AI gateway. So, we work with about 70 different model providers, hundreds of different actual models, um and it helps make sure that your agent succeed, that it's easy and delightful to build on top of AI. Um we do things like automatic failover and rate limit management and unified billing and unified APIs. Basically, you can integrate once with OpenRouter and know that come what may in the world of AI, uh whether it's this lab or that cloud, um or this modality or that modality, it'll work if you integrate with OpenRouter. So, we preserve that optionality for you through time, give you a single pane of glass for observability, and we process this week I think we'll process about 28 trillion tokens this week, which is like quite a few more than I think Salesforce has processed in the lifetime of the entire company. So, there's some real scale here. Um we are probably doing something like 1% of total global inference. Um and as a result, we have a great viewpoint all over the world about what models are being used for what, um and a whole lot of institutions use us, and maybe some people here have actually looked at our data as well. We publish rankings data for all the different models, um and for uh all the different use cases that you can go on and look.
It's totally free, um to help customers and users of AI and even model labs do better and make better software and also build better models and help the AI ecosystem move forward. Um about 50% of the businesses in the US, 50% is the rest of the world, and we feel really good that a lot of the trends that are reflected on OpenRouter are in fact true global trends.
Um and they're coming from a lot of models.
This is all of the models we launched in 2025.
It's like totally insane. There are a lot more than this that existed. These are the ones we actually launched. So, we can figure out these, launch them, set up the API, um and a A of these labs and a lot of these folks are sort of like building towards what's next, right? That's why all of the labs continue researching, continue making new models, because they're trying to actually make AI a reality, right? We spent a couple of years sort of in the post-Chat GPT moment with all these companies, maybe some of the people in this room, maybe bosses of people in this room, sort of pounding the table and say, "Go make AI work for your company."
Um and then suddenly, and it's kind of hard, right? We're like chatting with things, maybe like had some custom data in like a project in Chat GPT, but suddenly, in the last couple of months, it started working, right? All of a sudden, agents like They're like, "Oh my god, I can actually ask an agent to do something and it's actually going to work." And we see agentic token usage taking off and all of a sudden overtaking human usage. We can see this in our data, right? And this is a real trend. I think Look, even if we didn't have the data, everybody can feel this anyway, right? And you see agentic usage like totally taking off. Um and there's something else that's happening as agentic usage takes off, which is that like it's way more expensive and burns way more tokens than I think many people anticipated. Um this is an average uh turn from an agentic request, which often has like a whole lot of context, right? It's not just somebody chatting with a little extra data. It's a whole lot of context. It might have a ton of tool call definitions or like MCP gateway definitions, uh skill, um front matter, right? And so you end up with a lot of tokens. You have reasoning and tool calls going back and forth versus uh what was happening when like humans were interacting largely in a chat modality.
So, what do agents need to be successful? We're going to answer that question in a second. I'm doing a really complicated live demo at the same time.
So, I'm going to start a demo in the background that we're going to return to later.
And I'm going to I'd say there's like a 30% chance this works. Okay.
So, agents are chugging ahead using all these tokens, and they need a bunch of stuff to be successful. Um one of the things they need to be successful is high-quality inference, right? So, here is a benchmark GPTQA 16 against GBOSS 120B, which is a popular open weight model. Uh this was done by the fine folks at Artificial Analysis, uh good good friends of ours. And they benchmarked I'm making them throw our providers under the bus. We're not doing it. It's not OpenRouter. It's all Artificial Analysis.
They benchmarked the same model GBOSS 120B against all of these different providers. And it actually shows that there's variable performance on the same benchmark from the same model that varies by provider, right? So, inference quality can vary depending on where you are sourcing those tokens from. And this is not quantization, right? Um that could be a factor, but that's generally not what's actually happening. Um what we see largely is that if you think about models as being like a pile of math and a bunch of numbers like running on a GPU, right? The model weights.
There's a lot of software between the math model weights and actually serving a valid API. And there are a lot of ways to misconfigure or introduce bugs or parse tool calls incorrectly or do all kinds of wacky stuff that can create variability in the actual performance of that model, even when the weights are identical, right? And so, we don't view this as like adversarial. No one's trying to like There's a lot of chatter like, "Oh, people are trying to monkey with the weights so the models don't perform as well." We don't really see that. We see a lot of clouds, a lot of providers trying to serve high-quality inference, and sometimes it doesn't work out because there's issues in the implementation. But for agents to succeed, they need high-quality inference.
And they also need to call tools, right?
This is the trend in the percentage of requests. This ends in I think March, but it continues through April, um of just looking at Anthropic models like on OpenRouter, what percentage of those contain tool definitions?
Or sorry, this is a tool called function trees. They actually use the tool. And you can just see this number is is inexorably marching upwards. It really maps to how agents are using these things. They need tool call definitions to work successfully. So, we see in this case like 55% of requests ask for tools.
The model used those 83% of the time.
And so, 46% of the time uh the reason that the LM returned a result was because of tool call. So, models need high-quality inference and they need to call tools a whole heck of a lot. All right. Now, we're doing my really complicated demo.
I'm going to pause this. We'll go back.
This will all make sense later.
All right.
And the tool calls need to succeed. So, uh this is actual data from us. This is for GPT- OSS 20B. It's sort of an obscure model I picked on purpose because it's going to demo a little bit easier. And I'm not trying to throw anybody under the bus, but if we look at how many tool calls were successful, right? So, a tool call is like a JSON result from an LLM that says, "Please call the tool with these parameters."
And if the JSON's not valid, if it made up a tool call name, if the parameters weren't right, then the tool call failed. And you actually see there's quite a bit of variability tool calls were. And then also the different uh uh providers are actually asking for the tools a different amount of time, which generally is due to like again weird bugs in infrastructure. So, we detect all this and help agents route around it. So, if you're using GPT- OSS 20B, it should simply succeed and you kind of don't have to worry about it cuz we're going to solve a lot of these problems for you under the covers, right?
Um All right. So, where does that leave us? It leaves us there very risky to live demo, which everybody's waiting with baited breath.
All right. So, what I've done here is we have made a ton of tool calls to GPT- OSS 20B, the one we just talked about. We made 213 tool calls. We got one error.
This is uh basically a a sort of a demo environment we created which has a it's making but it's making real live actual LLM calls.
Um and it's giving a prompt and it's giving some fairly complex tool call schemas and seeing how frequently these calls succeed and don't succeed. We managed to get one error. I was hoping to get a couple of errors, but one of the problems working with LLMs is they are not deterministic, so I can't make them cough up errors on demand. All right. So, OpenRouter.
This is a console application that is running through OpenRouter.
And it is running to a preconfigured preset that I have that says only ever send these requests to Amazon Bedrock and always use GPT-OSS-20B.
And basically you can use this as a model slug in my SaaS for demo. So, all of my code is pointing to this model slug and saying, "Hey, use this provider and use uh this model." And I can edit this, right? And so, you can store your configurations here and I can say, "Hey, instead of using Bedrock, let's use Fireworks."
I can save this.
And I can rerun the exact same code, uh but this time it'll route all the requests to Fireworks.
And in an ideal world, we won't get any errors because Fireworks has an extremely low error rate. So, you can see the provider is now Fireworks.
And we're going to start jamming tool calls over towards Fireworks and we'll like see how many of them are valid, how many times the model actually chooses to use tools and not.
Um and in a normal environment, what would be happening is you just simply are sending requests to OpenRouter for this particular model and we're going to figure all this out for you, right? I'm trying to induce a couple of errors and then induce fewer errors the next time around. Um but what we'll actually do in the uh when you use this live is we monitor all of this in real time across thousands of API endpoints. And if you have tool calls or if you have different parameters that your agents are using, we will help the that agent actually succeed uh by not sending it to providers that are having problems, whether that's uptime or all sorts of other issues. So, there's a variety of things that we do like guardrails or accuracy tracking or throughput tracking uh cost management normalizing tool calling across different API shapes that allows your agents to be successful in open router versus going directly to the underlying inference. So, it is now hit the clock is at zero. We've had zero errors.
I'm going to stop it quickly before something changes, but we saw a couple of errors from a provider that often coughs up errors and then when we switch to different provider exact same model exact same API, but errors sort of disappear because of the implementation of the underlying thing. So, open router will abstract this away for you and your agents will be more successful.
Thanks. That's it. We have a booth. Go say hi. Happy to answer questions, etc. Great to see everybody.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











