Patrick correctly identifies that while GPUs grab the headlines, the actual "agency" of AI relies on the brute-force density of server CPUs to handle logic and orchestration. It is a pragmatic reminder that even the most advanced autonomous systems are ultimately limited by the physical constraints of data center plumbing.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Get Smart and Make Money Learning About Server CPUs for Agentic AI
Added:Server CPUs have gone from the doghouse to being ultra important in the world of Agentic AI. No matter if you're excited or frightened by the prospects of Agentic AI, you have to get into the CPU game now. Here's why. On June 3rd, 2026, Matthew Prince, the CEO of Cloudflare, said that over half of web traffic was now being generated by AI bots. At the end of the day, you can either pretend that it's not happening, but this is impacting everybody who runs servers.
So, let's get to it. Hey guys, this is Patrick from STH. And if you've been anywhere around data center computing over the last few months, you've probably heard that there's a lot of emphasis now on server CPUs. At ST, we've been building agent STH, a new benchmark suite to really test CPU performance in the world of Agentic AI.
We've also been working with companies to understand how Agentic AI is really impacting their workloads. And in addition, I've been doing this for like 17 plus years, guys. So, I thought it was important to just give a level set because there have been so many armchair experts that have sprung up over the last, you know, few weeks and months. I it's just it's insane some of the things that folks are saying out there. So, I thought it's really important to get a video out and explain what's going on in the world of Aentic AI and specifically in relation to CPUs because there's a lot of really weird information out there. This is an industrywide thing, but we're going to say that this is sponsored by AMD because we're specifically going to be using some CPUs that AMD sent. And if you get nothing else from watching this video as an ST reader or viewer, I want you to really think about how can you use this new workload and all the new load that it's generating to go and advance your career, make money off it, because that's really what it's about. And if you know the drivers behind it, you're going to be able to prepare your organization or other organizations to do this giant infrastructure buildout and make lots of money in the process.
So that's really the point. But really, let's get into why server CPUs have become such a hot topic and really how they're used in Aentic AI because a lot of folks are missing like half of the equation or more. In a modern data center, CPUs are absolutely everywhere.
Sometimes they're attached to GPUs and everybody knows this as part of like the GPU attach. And really the idea of the CPU is to ensure that data is prepared and also that you know sometimes you know using that memory as expansion capability to go and really keep data flowing to the GPUs so GPUs aren't just sitting there waiting for stuff. One of the craziest things, by the way, is that folks are saying that like a CPU to GPU attach rate is like 8 to1. And I've heard this number so many times now from experts. That's definitely not been the predominant use case or scale out GPU to CPU ratio even in just the AI servers.
So I have no idea where some of these numbers are coming from. But in data centers, there are a lot more CPUs than just that are sitting in the GPU clusters or specifically the GPU nodes themselves. Like for example, you can have CPU nodes that are running your storage. You can also have just you know virtualization or kubernetes or other kind of container nodes just application hosts. And by the way there are even server CPUs especially embedded ones in things like network switches and sometimes even network adapter. So there are server CPUs absolutely everywhere in the data center. But let's start with really what's happening in an agentic workflow and let's use openclaw. Of course you know we use openclaw hermes all kinds of different things. First off, that aentic framework has to run somewhere and it's not a GPU, it's a CPU. And you install a open cloud. You know, for example, here's the simple commands to go do a simple install and just get it running. And the next step is you're going to want to set up your security profiles, of course, but then you're also going to want to go and set up access to different resources for that agent, right? Like how are you going to provide permissions for that agent? And one of the ways that has really helped us is just thinking of these AI agents as though they're kind of like contractors, right? Like if you don't you're not going to give a contractor into your business like access to every single system unfettered, right? Like that would be crazy. Instead, you're going to limit access and tailor what resources that agent can make, which is specific for their role. Now, after you've gotten all this set up, the next step is really you're going to want to of course connect an LLM. And this is the part that everybody's been super excited about when it comes to AI and Aentic AI is of course how fast the LMS, what are all the new models, all that kind of stuff. But those are generally running on GPUs and they also are accessed via an API. Now, those API calls of course from the Aentic framework of course are being run on CPUs. So, that's something to keep in mind because when we talk about CPU load, there's a bunch that's going on here. Now, of course, the heavy AI compute that's always going to sit on a GPU or accelerator. It's not going to really sit on the CPU as much anymore, but the CPU is always going to be a big part of it. One of the big reasons for this is that you really want to run as much of your workloads deterministically as you can. So, one of the things that when folks get started with a Gentic Workflows, they often times like will just say like, "Hey, go go go set up a system and they'll just like let the LM go and make all the commands." And whenever you do that, one of the challenges is that the LLM makes little mistakes here and there and like, you know, those mistakes happen the first time, then when it comes back later to go do something similar, guess what? Oh, we forgot a quotation or whatever and it's happening again. And so something that you will very quickly find is that you really want the LM to write scripts to deterministically be able to, you know, fire off a portion of whatever workflow is. Now, you may end up inserting variables and stuff with the LM later, but the idea really should be that you're running as you're using the LLM to make scripts to go and do a lot of the work. So that way you can kind of trust the results a lot better than if you just have the LM try to generate these things or requests willy-nilly every single time. If you don't believe me, just use Frontier models and say like, "Hey, let's go set up a couple servers and let them go and do like pseudo access and not use scripts and just kind of go and develop the commands themselves and then spit them back out.
I mean, you will definitely notice that if you set up 100 systems like that, a huge portion of them will fail." Now, I will say the LMS are getting better, but it used to be that we would see like 40% of the calls would fail. Now, we're down to like maybe 25ish%. So, it's getting better, but it's certainly nowhere near zero. And it's nowhere where you could actually like, you know, even if you're at 5%. That's way too much failure on these things. You should always go and have the LLM spit out the command and then fill in the variables whenever it needs it. And let's be clear, modern LLMs will 100% see that they made an error. They'll go back and they'll go and reset something up and they'll usually fix it the second time. But that's always one failure plus one setup again. And you know, that's just one of those ones where um you know, it loses time, you're burning tokens, and you just don't want that in your workflow because what happens the day that something really important gets messed up in there and uh and you lose access to a machine or something like that.
Like that's the wrong way to go. So with Aentic AI, one of the most important concepts and something that you hear a lot of folks that work in the industry talk about a lot is how CPUs are often doing tool calling. And one of the reasons for that is just to go and get that deterministic highly reliable way to go and call and do functions that you need to work every time. Now, of course, there are a number of ways that you can host these tools depending on who you want to have access to it and agents if you want access to it and stuff, right?
You could host them on a bare metal system, which would be pretty, you know, probably not the way that most folks would do it these days, but you could also run them in containers, even virtual machines. I guess VPS a lot of times that's what that is. And now something we're seeing is a lot more short-lived sandboxes being built and then destroyed right away. And that's really changing the way that these things are being used. So when people talk about Aentkai performance, especially on the CPU side, there are a couple things that you really need to keep in mind. Remember that when we go and hit the LM that's running on GPU somewhere that, you know, there's a whole bunch of things you can do on that side, but on the CPU side, the CPU performance really matters. And there are a ton of folks in the industry that think it's like some kind of magic, this aentic stuff. But the more workloads that we've gone and we've actually profiled, the less that I kind of feel like it's magic and it's more like something that you can actually profile and understand. Now all workloads are different. So that's of course a challenge, but let me kind of give you some highle thoughts on just how this whole thing works. First thing to understand is something called IPC or instructions per clock. This is a simple metric that determines how much work a CPU can do given a given clock. Now of course a lot of folks talk about clock speed and gigahertz and all that kind of stuff. And the idea there is that you have a certain number of instructions that you can go and retire and run each clock and then you run those clocks. So you can do you know a clock uh a clock cycle so many times per second, right?
That's kind of the basics of how CPU performance works. When we talk about instructions per clock, a lot of times we're talking about a theoretical value because really what you need is you need the data plus the instruction to be able to go into a core run and then have the output of that go to the next stage. Of course guys, this is super high level, but we're going to just keep it there.
And so one of the things that you see because you need to ensure that you have both the data and then what to do with the data uh into the CPU core. You can think of it like you have stuff in memory and then you have stuff that's uh you know sitting in cache. So big caches tend to help because they hide that memory access. And then finally, you know, how fast the actual execution core is as well. And by the way, the ideal state is always that the CPU every time a new clock cycle starts has the data and instruction ready to go and be loaded and actually go execute. So that way you don't lose any you know any theoretical performance because there's not either the you know there's not the data there to go and actually compute.
That is something that engineers have spent decades trying to figure out how to keep these fast execution cores fed efficiently. The next important thing, of course, is how many cores you have.
Because if you just put this mental model in place, right? It's like performance is you have a a core that's just churning. You have to get data into that core, but as long as that thing can go and churn, well, it's going to go put out a whole bunch of stuff. And then your clock speed says, you know, how many times per second that you actually can go churn on on stuff. And then the number of cores you have is like how many times you get to replicate that across a single socket. So guys, this whole Agentic AI CPU thing that everybody thinks is like blackmagic essentially comes down to keeping your cores fed so they can go maximize the actual versus theoretical IPC and having as many cores as you possibly can so that way you can go and do as much work per socket and per server as possible.
And once you your organization really starts getting into Agentic AI workflows, something you'll notice very quickly is that you kind of are able to go and identify new things for agents to go do a lot better the more experience that you and your organization get out of it. And that is really what is driving this whole agentic AI CPU thing because people are saying hey look this is a net new workload not necessarily like you know some kind of crazy thing but on the other hand it is something that is typically done by CPUs and so if you have more workloads more agents doing things that is a net new workload and that's creating this demand for CPUs and it's creating a lot of the challenges on the supply side that we have in the data center let's call it what it is now of course the next thing that's going to happen is the same thing that when people went from like phone orders and faxes to EDI or people standing in pits doing stock trading versus, you know, electronic systems, right? At some point, these agents are going to talk to one another. And that's where a lot of these interactions and commerce exchanges are going to happen.
When you start getting agents talking to one another, speed is going to matter just like it does in all of the other domains where we've started going from like, you know, human to human or human to machine and we've do machine to machine, right? It's it's the same thing that's happened in every one of these domains previously. But that's really just running the agents. What people don't really talk about that much is the other side of the equation. So let's go back to that June 3rd 2026 Matthew Prince comment because well the other side is that those agents are also hitting other systems. Web servers have been around for decades and largely a web server is a solved problem. It's like at this point for many years people have just been figuring out how to like serve web pages at the absolute lowest power lowest cost. But this is a new problem because if your growth on those web servers is not, you know, it's not just kind of a normal growth curve. If there's these new agentic AI, uh, you know, agents going out and like hitting your web servers and hitting everybody's web servers, so all of a sudden the challenge of a web server is, well, do I have enough capacity anymore? Especially looking forward a couple months. And it's not just web servers anymore. It could be like ERP, could be database servers, many other internal applications. I mean, there are so many domains where these can hit. And what we're starting to see is that companies are starting to get hit by Agentic AI in their internal applications and that's causing all kinds of crazy things like you know what happens if this is our licensed cost to go and service humans but now our you know traffic goes up by 4x or something like that like what are we going to go do this is a great opportunity for all the STERS out there and that's another reason that companies heavily in the AI space are rushing to secure server CPUs right now because they know that not only do you have like the Aentic AI the net new workload that's just ramping like crazy. But there's this secondary effect that happens on all of everybody's you know existing systems and anything that you know the agents are creating as new systems right these things are all like like just growing like crazy and when they grow like crazy you need more server CPUs because that application that's been running at 15% utilization and 8 core VM for you know eons and it maybe spikes at 70% at the end of quarter well that no is now at like 100% and you have to go figure out what you're going to grow into and that's why people need more server CP this is a huge opportunity Luckily, if you've been following STH for the last 17 plus years, you probably have the skills to help either your organization or other organizations navigate this. And so, I think that this is a huge opportunity for anyone watching this video. Now that we've talked about the background and how to conceptualize this stuff, let's talk about what good looks like today.
And I was at Dell Tech World recently.
Del flew me out, so full disclosure. But one of the servers that we saw there was the Power Edge 7000 series, which of course 7000 is their 2U servers. And in there, you can have two AMD Epic SP5 processors. Now, the reason that matters is because each SP5 processor can run up to 192 cores with the AMD Epic 9965, and that's currently the highest density CPU that you can get. That 9965 has 192 cores, but 384 threads. When you have two of them in a system, that's 384 cores or 768 threads. Frankly, a 2U two socket server is not really the type of thing that you deploy if you just want maximum density. But I'm using it here for a very important reason. Folks are saying that around 8,000 cores per rack is a super like exciting number. But you can actually do that today with Zen 5C.
You can set up these servers. Let's say you have a 48U rack, you have 4U for networking and what have you, and you're going to use 44U for compute. That means that you can have 22 systems. Each one of those 22 systems has two CPUs. each one of those CPUs with 192 cores. That gives you over 8,800 cores and almost 17,000 threads in that rack. And that's using today's technology. Guys, there's another generation that AMD has already talked about, which is their Venice generation, and those numbers will go up significantly. But guys, this really just saying how dense we can get server CPU cores and threads for running AI agents. But at the same time, you have to remember that that also can run the second order order systems, right? all the existing things whether that's your web servers those are your ERP systems you know you may also want to go mix in maybe you don't want to have all of them be Zen 5C cores maybe you just want high frequency Zen 5 cores because you want the highest performance per core possible and that's why we covered this so extensively in our 2026 server buyers guide that you'll find linked by the way below one of the things that's happening this year is that the later that you're buying in the year as the memory and NAND prices continue to go up really making purchases earlier in the year is saving folks a lot of money we've seen companies actually go blow their entire server budget in Q1 this year just to go and protect themselves against rising prices. So I thought maybe I should go and give a little bit of an addendum really on what you should be thinking about this year when you hear all this stuff about server CPUs for Aentic AI and what that really even means. First off, let's talk about per core performance because that's one that is, you know, was a challenge actually with agent STH when we were developing that benchmark because, you know, like what is a core? And I know folks know what core is, but what happens when you have a SMT core where you have two threads in that core? Because if you say a core is does not include that SMT, well then you're missing a big feature of that core. On the other hand, if you are benchmarking and you're saying like, oh, what do I do if I have one core and two threads? Well, if you run two threads on that, then what a lot of folks will do is they'll say that's the performance of that single core. One core, two threads.
But what you would do, of course, if you don't have SMT is you would say, well, our per thread performance is so much better because, you know, an SMT core is only maybe, you know, 30%, 20%, whatever it is, better than a single core by itself. So, as you're evaluating performance claims in this new era, just remember that a core that has SMT, you should be looking at what the performances of the core with its two SMT threads, not just one thread on that core, or looking at the per thread performance, because either of those will get you weird numbers that are usually not right. Something else that is super important is the performance per socket because performance per socket tends to dominate a lot of the cost discussion because you know you think about it you have sheet metal you have power supplies or you know whatever uh motherboards boot drives all kinds of things in a server that really uh you know you can't share or you'd have to replicate if you had a whole bunch of systems per socket performance is a huge number in the industry or at least it should be. Now memory bandwidth and memory bandwidth uh bound workloads are actually pretty common in the industry.
Give you an example. If you do web hosting or you know a lot of applications where you're using Reddus for example that a lot of times is dominated by memory performance. Also there are things like uh you know HPC applications famous for being memory bandwidth bound. And in the industry, you'll commonly see the stream benchmark used because that's, you know, it's become like the benchmark for memory bandwidth, right? And that's one of the reasons, by the way, that you'll see stream in almost every vendor marketing material for new generations of processors because it's a super easy way to get more gain out of a CPU versus the previous generation, especially compared to just making cores a little bit faster. And let's get to why, right? A server CPU memory bandwidth is largely dominated by two things. one how fast the memory is and the memory channels are and then how many memory channels or how many pipes you have going into a server CPU socket. Now there are especially some cases that we've seen recently where uh you know there have been some weird things around that but on the on the vast majority of server CPUs that's the number one thing that impacts your stream performance. So, let's stick with AMD examples. And a really good one was when we went from the AMD Epic 70003 series Milan to the AMD Epic 94 series Genua. You'll see that on the Milan generation, we had eight channels of DDR4 3200. But then we went to Genua, we got 12 channels of DDR5 4,800. Of course, there are some other things that can impact that, but 3200 4,800. That's 1,600 difference, which means that we got essentially 50% more bandwidth per channel. And then our channels went up by 50% because we went from 8 to 12 channels. And so the impact of that is it gives you about two and a quarter% or two and a quarter times more memory bandwidth in Genoa versus Milan. Getting a single CPU core to go up by 2x performance on just raw computation in each generation is something that we just really haven't seen. I don't know if it's impossible, but it's just really really hard to do. And so that's a really good reason that folks use memory bandwidth and stream in the industry because that is how you show that you have outsized gates. Every vendor by the way does this because of that. And the more stream or more streamlike more memory bandwidth workloads that you go put in a new generation of processors generally means that you have higher performance in a geomine because you have new memory coming out all the time that you're taking advantage of that's faster. And that brings me into key lessons learned like not just what you can do to profit on this but just how to think about this right many companies today have been doing the exact same thing for the last I don't know year or two maybe even longer they saying hey you know we largely have a server infrastructure that we don't need to upgrade because well you know the old servers run web hosts no problem so why would we need to go and upgrade that and that was a okay way to think but now with the Gentai really taking off those applications are being hit exactly the Matthew Prince quote that we've pulled up earlier. The first thing in this video was really just to show that web hosts are getting hit. Even those applications that everybody thought were solved, we've already seen a crossover point where Aentic AI is hitting that harder than you have, you know, humans doing it. And so that's a really good use case where you're going to run into scaling issues at some point in the not too distant future, especially if you've been using old web hosts and you just haven't been upgrading. At the same time, it's not just legacy applications that are being hit. It's also this new agentic AI workflow that's happening because of course although you have your LLMs and you need to go and and you know have a strategy for being able to generate tokens at a relatively low cost. The other side is that you need CPUs because at the end of the day so many of those applications go back and use CPUs and this is like a net new workload because it's a net new type of work that's being done. A lot of folks just don't have the capacity to do all this. That's a super important thing. If you want high density today we've shown you a Dell option. If you use the AMD Epic 9965, you can get like over 8,800 cores per rack, which is I mean just insane. And by the way, in the second half of 2026, early 2027, you're going to see that that number will go up even more. Whether you run large infrastructure at work, you do consulting today, or maybe you just have been playing around with a home lab and you're wondering how do we go turn that home lab into a marketable skill, guys, this is the way to go and do it. I'd love to hear what you guys think. Let me know down in the comments and in the forums. But to me, this is just so much opportunity. So, let me know. By the way, if you did like this video, why don't you share with your friends and colleagues? Also, give it a like, click subscribe, and turn on those notifications so you can see whenever we come out with great new videos. As always, thanks for watching. I am an awesome day.
Related Videos
LBF101 Creating an XML Changelog
liquibase7511
3K views•2026-06-15
Alta Labs Cloud Dashboard Real time Network & Xnet Insights!
ShinyTechThings
158 views•2026-06-17
Wait... Group Policy Not Applying? Check This First!
keeplearning_iT
144 views•2026-06-15
Leetcode Weekly Contest 506 | Life's boring these days
Pudeesht
2K views•2026-06-14
microJAM: MAKING A MICRO GAME FOR A GAME JAM IN CLOJURESCRIPT AND TOTALLY NOT C
janetacarr
156 views•2026-06-18
Partitioning vs Bucketing vs Clustering: How to Make Queries 100x Faster
thedataandaiguy
194 views•2026-06-16
Design Claude Code Like a Senior Engineer
hayk.simonyan
344 views•2026-06-19
Linus Torvalds: AI Won’t Replace Understanding Code
SavvyNik
140 views•2026-06-19











