Install our extension to search inside any video instantly.

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case
Added: 2026-05-28

217 views1246:06stanfordonlineOriginal Release: 2026-05-27

The economics of AI infrastructure are driven by compute capacity as the primary constraint, with revenue serving as a lagging indicator of compute utilization. As AI models evolve from simple chatbots to agentic systems requiring complex compute graphs, the infrastructure landscape is shifting toward heterogeneous compute architectures that combine GPUs, CPUs, and specialized accelerators like Cerebras. The transition from training-heavy to inference-heavy workloads (projected at 80%+ of future compute) is reshaping infrastructure economics, with concentrated gigawatt-scale data centers favored over distributed edge compute due to economies of scale and the current latency requirements of agentic workloads. This infrastructure evolution is creating significant supply chain challenges, with ASML machines serving as a critical choke point, and is expected to drive long-term value toward foundational infrastructure layers rather than application wrappers.

[00:00:09]Hello everybody. It's good to see you again. Another round of the economics of the AI super cycle. This time with uh Professor Kati. Yes.

[00:00:20]>> Welcome back. Welcome back to Stanford.

[00:00:22]>> Thank you. Thank you. Um, you know, I as as I thought about introducing Professor Kati, I could not think of a better person who has seen the entire soup to nuts of electrons, the entire substrate all the way to agents. You obviously started a networking startup. You were the CTO and head of AI at Intel. You now run compute, industrial compute at OpenAI.

[00:00:49]Thank you. Thank you for joining us.

[00:00:51]>> Thank you. It's coming back home for me.

[00:00:53]So, >> welcome.

[00:00:54]>> Welcome.

[00:00:55]>> You know, I thought we'd started a fun segment. Intel spent about a decade trying to convince everybody that they're an AI company.

[00:01:02]>> It finally happened. They finally got there in the last like two weeks. What happened?

[00:01:08]>> I told you. So, uh that was my job. So I was as April was mentioning I was Intel CTO and also running its AI business until I left for OpenAI in November. Uh so yeah a little bit of a lag but you know that's the that's the bane of people who have to forecast but no it's I think it's Intel story is turning. Uh I'd say there are two factors that have big tailwinds for Intel.

[00:01:38]>> One is the world is heavily supply constrainted.

[00:01:42]>> Mhm. And so any company that has serious manufacturing jobs in the space and and can build, >> not just design is going to have tailwinds, right? And Intel obviously is pretty much the only leading edge American company left that still can manufacture.

[00:02:01]>> Uh the other of course is CPUs are making a comeback uh uh with how we are beginning to use AI with agents and we can get into that a little bit later. So both of those are very good things obviously for Intel. Uh a lot of execution still to be done. I mean the market always is ahead of >> the story but uh fingers crossed. Liu is is is a great CEO. Loved working with them. So I think good things to look forward to.

[00:02:30]>> Amazing. I'm sure your departure had nothing to do with the stock chart.

[00:02:33]They're not correlated.

[00:02:34]>> I still kept my stock. So don't worry.

[00:02:37]>> Well done. Well done. Well, we'll get more into the role of CPUs, all the different parts of the uh compute supply chain. You know, the second thing I thought we'd we'd spend some time on is this chart that OpenAI put out at the start of the year for everybody. Sarah Frier, the Open CFO, wrote this article about OpenAI's compute ambitions. And on the left you'll see is OpenAI's compute capacity over the last three years.

[00:03:00]This is OpenAI's target by the end of the decade. And it magically seems such in that the compute capacity seems to be hyperorrelated with our revenue.

[00:03:13]No prizes for guessing what this might be if this gets there.

[00:03:16]>> Yeah, talk about this chart for a second. What's what's going on here and how should we how should we process this?

[00:03:23]>> Yeah, so at OpenAI I lead industrial compute. Just sort of some context before I answer the question. Uh so my job and my team's job is uh delivering the compute that OpenAI needs across everything across training and inference. So this chart is what I live and breathe every day and making the numbers go up >> and yes my job is to make it go up and into the right. So that's that's the job description. Uh but kidding aside, I think u it has as you pointed out uh revenue is basically a lagging indicator for frontier lab companies >> and what I mean by that is it basically is very simple calculation of how much compute we have and how well utilized is the compute.

[00:04:10]>> Mh.

[00:04:10]>> Right. And so the last three years have bone that out. Every year we have tripled compute year-over-year.

[00:04:16]>> Right.

[00:04:16]>> And revenue has tripled.

[00:04:18]>> Mhm. uh we don't see any end in sight to the correlation yet. Uh I think uh just 5.5 coming out and uh the uptake uh I mean CEX has probably seen meaningful doubledigit growth just in two weeks >> since uh 5.5 came out. Uh I think people are using it for not just coding anymore. Codeex is being used for general purpose knowledge work. So token usage uh and just more more and more complex tasks uh being consumed. So I we essentially are tracking how much compute we have available and the number of users the number of tokens and therefore the revenue basically is tracking tracking that >> uh I'd say that as we think about the future uh we open is still a research lab >> and the reason I say that is it's very much not just a here's how much revenue we can maximize it's much more rather how do we make the maximum amount of compute we can make possible for research right so that researchers are unconstrained in exploring new ideas new models and new ways of pushing the frontier on intelligence >> and so the 30 gawatt number here that which is an aspirational goal is a split is split across research and products but we definitely don't see a world where we don't utilize it uh if given the current trends that you're seeing >> and maybe just a quick followup before we move on. What's the rough split between training and inference? Um, and how's that trended over time and how do you expect it to uh for go over over time?

[00:06:01]>> I think the scaling laws right so obviously scaling laws initially everyone assumed applied for pre-training only. What has shifted is scaling laws have evolved to cover the entire life cycle of compute and what I mean by that is pre-training post-training with RL >> which is primarily an inference workload.

[00:06:21]>> Mhm.

[00:06:22]>> Synthetic data because we have run out of real world data to train models on.

[00:06:27]So we generating data to train models on that is primarily an inference workload.

[00:06:32]Mhm.

[00:06:32]>> And then of course the actual products themselves, everyone using chat GBD and Codex and that is an inference workload.

[00:06:39]>> So more and more it is shifting to inference. I mean inference is already the majority just to be clear.

[00:06:46]>> But even inference should not be taken to mean just products. A big chunk of research a big chunk of training the next level of intelligence is also inference. Mhm.

[00:06:58]>> And so that's why our prediction is that a super majority plus like 80% plus will essentially be inference compute >> right >> in the future. Is it also true just building on your response Sachin is it also true that if the relative ratio of how much gets used for inference goes up over time the dollar density meaning dollars per gawatt might also go up because inference is basically what you can monetize >> well we are hoping it goes down so dollars per gawatt because uh this stuff is expensive every gigawatt is roughly >> I meant monetization sorry monetization Yes. Yes. Uh I think yes for sure right as more tokens get consumed uh that should lead to a corresponding increase uh in revenue. At the same time I think our mission is to make tokens cheaper right and and and it's it's two two different dimensions. One is make every token cheaper make every token more intelligent and make every task require less number of tokens to perform. Right? So we push on three dimensions. keep improving hardware and software to generate tokens more cheaply.

[00:08:12]>> We push keep pushing on the capabilities of models to make sure every token is more intelligent.

[00:08:18]>> Mhm.

[00:08:18]>> Right. And we keep pushing the harness like codeex to make it such that we need less number of tokens to perform any given task. Right.

[00:08:28]>> And that's I that is a very fundamental principle in which in way the in how the company operates. And the the reason of course is how do we make sure that all of this intelligence is as widely accessible as possible.

[00:08:43]>> Now your job as you said is to get the numbers to go up top and to the right the uh forecast honestly I don't envy anybody who's forecasting tripling year-over-year at that scale seems like a hard job to not only forecasting and is an easier job than actually making it happen.

[00:09:01]What?

[00:09:02]>> You think so?

[00:09:04]>> What? Tell us about your job a little bit. What is the hardest part of it? Is it is it sourcing the the the compute?

[00:09:11]Is it securing it? Is it uh and how are you securing uh compute right now? It seems like a fist fight. And where's the bottleneck? Is it power? Is it is it energy? Is it is it chips? Is it land?

[00:09:22]Is it >> all of the above? U it is it is uh I I think if you think about the life cycle of compute, right? So one is obviously sourcing compute and compute is a very broad term. Uh when you think about compute for AI you really have to think chips, memory, networking, power cooling, data center buildings, power generation, power distribution and of course land.

[00:09:56]All of that is equal to compute, right?

[00:09:59]All of that needs to come together to build compute at a gawatt scale, right?

[00:10:04]>> And so when we think about sourcing, we are not sourcing comput. We are literally sourcing that entire supply chain.

[00:10:11]>> Mhm.

[00:10:11]>> And making sure at this scale that we have visibility into where that each component of that supply chain will come from.

[00:10:20]>> Mhm.

[00:10:20]>> So that is one big piece. The second piece is how do we orchestrate that supply chain to all land and align at the same time to make this compute operational. Right? So a gigawatt is roughly a million G half a million GPUs.

[00:10:36]>> Right? And so that's uh and and when we're talking about whatever number it is, six or 10 gawatt, you're talking about quite a large number of chips beingworked together, being powered, being cooled, being kept up and alive, made sure that everything else that needs to come together is there. So a big chunk of the work really starts after you sign the contracts. like how do I make sure that your suppliers are actually going to deliver what they said they will? How do we make sure that we engineer these systems so that it all works together at this scale?

[00:11:13]>> And how do we make sure that it is operationally usable like it is up and running and runs at the highest performance uh we can run these chips at. And these chips are very brittle today.

[00:11:26]>> Uh very sensitive to cooling and power fluctuations. And they can quickly throttle back in in how much compute, how many flops you have.

[00:11:35]>> So that's really the job. It's uh the fun part is the contract signing.

[00:11:40]>> The hard part is everything after.

[00:11:42]>> Yeah. Yeah. Yeah. Getting the autographs.

[00:11:44]>> Yes.

[00:11:45]you know, you must be um it's it's a very consequential time right now and I imagine a lot of the decisions you're making will impact us and the rest of uh compute users, which is, you know, billions of people for years, if not decades to come. What are some of the biggest trade-offs you're making? What are some of the biggest decisions you're making that will make, you know, case studies at some point down the future um that that that you're that you're uh that you can talk about?

[00:12:12]I mean I I think it's there's a lot of uh societal level implications of these decisions right to pick an example if you put a gigawatt data center in a place like Georgia or Michigan for example >> it's a pretty big consumer of the grid >> right in that amount of power and when you run a big training job these things are synchronized jobs right they go up and down in sync in intensity.

[00:12:43]>> Mhm.

[00:12:43]>> So you can see energy fluctuations on the grid that can be hundreds of megawatt very quickly.

[00:12:49]>> Mhm.

[00:12:50]>> And our infrastructure was never designed for it.

[00:12:53]>> Mhm.

[00:12:53]>> A grid could basically fall apart and an entire state could have a blackout >> depending on how these data centers behave. Right. So a lot of time we spend thinking about how do we make sure we can design these systems to not have all this collateral damage on the rest of the country's infrastructure.

[00:13:14]>> Mhm.

[00:13:15]>> Uh so that's an example of the kinds of things that are being redesigned.

[00:13:19]>> Right.

[00:13:20]>> Uh we obviously are spending a lot of time thinking about how to derisk supply chains, >> right? So how do we move fabs? How do we move memory factories to other parts of the world?

[00:13:35]>> Uh how do we decouple from grid energy and use natural gas and increasingly nuclear in the future?

[00:13:42]>> Mhm.

[00:13:43]>> So I think this is going to lead to infrastructure investments and innovations >> that the rest of society will benefit beyond AI >> because these are things that otherwise did not have an impetus to happen. Mhm.

[00:13:56]>> Then I'd say obviously all the implications of AI itself and compute at this scale.

[00:14:02]>> Uh I mean 30 GW is a lot, >> right?

[00:14:05]>> Uh but I'd say our vision is and Sam has been talking about this for a while. Like we we've all taken it for granted that every one of us should have a mo mobile phone and we upgrade one every year or every two years. It's not that crazy to think every one of us should have a GPU.

[00:14:25]>> Mhm.

[00:14:26]>> Right. And a GPU is what a kilowatt to 2 kilow now.

[00:14:30]>> 7 billion humans out there.

[00:14:32]>> That's uh 700 7 terowatts of compute, >> right? And so that is two orders of magnitude more than what we are talking about here. So if you really believe in that world, then we still have a long ways to go.

[00:14:48]>> Right. And maybe just put this in perspective Sachin, how much energy does America consume compared to 30 gawatt?

[00:14:56]>> I think I don't have the number off the top of my head, but uh I think the US is if you add up all the hyperscalers is planning to build around 100 gawatt of compute.

[00:15:06]>> Got it.

[00:15:08]>> Beyond OpenAI, so 30 gawatt of us, whatever else everyone else builds.

[00:15:12]>> You've seen Google's numbers, Amazon's numbers.

[00:15:15]100 gawatt is probably already a fifth to higher of the grid.

[00:15:20]>> Uh so this will be consuming double digit percentage of US capacity.

[00:15:25]>> Wow.

[00:15:27]>> Making the market >> it will change the market, right? I think the the way we think about energy as just purely for human consumption is no longer true.

[00:15:37]>> Yeah. You know, one of the one of the rumors that's been going around um is that OpenAI has a significant compute advantage compared to the other labs um the class here loves uh both OpenAI and Enthropic equally sort of.

[00:15:53]>> Are we pulling?

[00:15:55]>> We uh we we did that and we'll we'll we'll we'll save you the answer. I I'll fill you in after.

[00:16:01]But um but uh talk about that computer advantage to an extent that you can you know share with us what does that afford us um what does that allow us to do you know assuming forecasting was perfect um what what does that afford us to do and and and deliver to to to consumers of of openi >> I mean you're seeing it right so uh 5.5 is a big model it's expensive to serve >> but there are no limits >> right so everyone's able to go and use it uh without token limit s uh we are much more generous on how many tokens you get for your subscription.

[00:16:37]>> Mhm.

[00:16:38]>> Uh we often every almost every day or every week reset the limits so that people can play with it a lot more.

[00:16:45]>> Yeah.

[00:16:46]>> And that's the compute advantage showing up in day-to-day usage. Right. And so that comes back to that earlier point which is making sure that we have enough compute to distribute this intelligence at scale. Not just build the intelligence.

[00:17:01]>> It's no good if you build the intelligence but you can't really deliver it at scale.

[00:17:05]>> Yeah.

[00:17:05]>> So really we spend a lot of time in making sure that it's not just about training.

[00:17:10]>> Yeah.

[00:17:11]>> It's actually usable compute that we can uh deliver uh to everyone at scale without putting >> artificial limits.

[00:17:19]>> 100%. One of the impacts that the class has already felt is we asked um two labs for a codeex and um unnamed product subscription. The codeex team gave us that pretty quickly. So I I I now understand why that was the case. Um we'll switch it up a little bit. Son, you know, uh codecs and and codecs like instruments have a lot of different things that need to come together. uh the the the the GPU and and all sorts of AS6 uh the CPU, the memory, the networking, all the things that you outlined us. Maybe uh maybe start with the workload uh in question. Um what does the modern agentic workload look like? How has that evolved over time? Uh I think the way maybe just to frame the answer, right? So chat GPD was obviously a big inflection moment. But if you think about chat GPD when it started, it really is oneot inference, >> right? You ask a question, it gives you an instant answer and you're done and you go to go to the next thing.

[00:18:23]>> I think the big innovation and the breakthrough in 2024 was reasoning, right?

[00:18:28]>> And so not just for inference but also for training, right? So being able to for the model model to introspect and think and therefore generate better answers >> and that again increased intensity of inference right so there's more and more inference uh happening >> uh I think but they are still passive things right you're asking a question they're giving you an answer they don't take any action for you right >> so the word agent kind of encodes what we mean it has agency it has agency to do things right and so what I mean by that is when we think about agents it's really about closing the loop not just thinking and suggesting but also trying it and looking at the output iterating and then trying a refined answer to do a task right so whether it's coding or any other form of knowledge work >> so it's really closing the loop right and it's actually delivering the full value of what we expect AI to deliver to you right not just be an assistant But I'll actually be an agent that can close the loop and do work for you.

[00:19:35]>> Right?

[00:19:36]>> And so implicit in that statement is obviously inference and thinking, but as I said trying, right? So it's going to go look for a relevant data. It's going to go search. It's going to go spin up a VM to run a test if it has generated some code. it's uh going to spin up Excel or PowerPoint to try out some slides and see how it looks, >> right? And it's going to look at the output and reason about it and iterate on this, right? And so to the graph, the compute graph is a lot more complex now, >> right? If I putting my computer science hat back on, if I thought about the chatbot world, it's a very simple compute graph. It's there's a user, there's one node, which is the inference call and there's an answer, right?

[00:20:22]reasoning was multiple nodes of inference calls >> and now we have a much more directed asyclic graph if you will to use a use the more precise technical term you have an inference call you might have a tool call you might have a database or a search query you might have a RL VM environment spun up then back to an inference call and so on and so on >> so the compute graph is now a lot more complex >> that you're executing right >> and so that naturally leads to a much more sophisticated compute infrastructure that's needed to execute that compute graph. A lot more intelligence needed in how you distribute that compute graph and where you run what part of the graph on.

[00:21:05]>> And so both the compute but more importantly the workload evolving in this direction is going to change the shape of how we think about compute infrastructure.

[00:21:17]>> Fascinating. you you outlined a bunch of different steps along the way. Um I can imagine some parts of that being more relevant for uh different machine like a GPU, other parts for CPUs and A6.

[00:21:31]Is there uh emerging maybe clusters of workloads that are particularly suited for a certain workload? You might say, hey, the Nvidia GPU is best for that.

[00:21:40]You might say the Cerebrus chips are best for this kind of a workload because you know agents come in all different shapes and sizes. You've got customer service chat bots that latency is a is a prime requirement as opposed to a deep deep research query where not latency but accuracy and and broad search.

[00:21:57]>> Yeah.

[00:21:57]>> Um are there clusters forming in in your view?

[00:22:01]>> Definitely. And maybe to use a slide.

[00:22:04]Yeah. This is what I was talking about earlier, right? So this is kind of a way to visualize what's happening right >> in a typical agent call.

[00:22:12]>> Mhm. Uh I guess this was a tongue-in-cheek slide that I had made.

[00:22:15]Uh today if you look at agents right you give it a task it goes off tries to do it uh thinks for a while tries a bunch of tools and then you have context switched.

[00:22:28]>> Mhm.

[00:22:29]>> You're going off doing something else but because it's taking minutes to maybe even hours to do it. Right. you spaced out all >> you spaced out and then when it comes back and asks you for a steer or a decision you have to page back all that context in and then you do whatever you do right >> and so our vision is we want to get to a world where the human is the bottleneck >> right today the AI is the bottleneck given how long it takes to execute all this right really we have succeeded from a compute perspective >> when we have built the systems and the infrastructure such that the human becomes the bottleneck when the AI is finishing these things so quickly.

[00:23:05]>> Mhm.

[00:23:06]>> That you are constantly being asked for what's the next step.

[00:23:10]>> Mhm.

[00:23:10]>> And that is a tongue-in-cheek point, but the better way to say that is how do we make sure human is in flow >> when they're doing this work with AI, right? And there's this feeling of flow when like everything's so quick and interactive and it's like it knows exactly what you need and it does it quickly.

[00:23:27]>> That's that's a user experience we'd love to deliver, right? And so as we think about this future, we do need heterogenous comput. You can't actually >> deliver this kind of experience economically on pure GPU based compute.

[00:23:44]>> Okay.

[00:23:44]>> So you need a much more heterogenous infrastructure that's not just GPUs and CPUs but also different kinds of accelerators.

[00:23:51]>> So Cerebrus is an example that is for very fast inference. Mhm.

[00:23:55]>> Uh you might have other accelerators that are built for very long context like they hold a lot of state in memory.

[00:24:02]>> So they can remember your entire task and don't have to page it back in and out.

[00:24:07]>> So >> for example, that could be useful >> for coding for sure, right? They have to hold your entire GitHub project in context and be able to pull that very quickly. So you are going to see a lot more flex hetrogenity in the underlying infrastructure because the user experience is going to push us >> towards optimizing every part of this agentic graph.

[00:24:30]>> And what we as people who have to build compute have to do is make sure we can match the right part of the workload >> to the right kind of compute >> to optimize on both efficiency as well as performance.

[00:24:42]>> Right. Fascinating. So this is uh going off script for a second off-roading you know yesterday big day four earnings calls. Um a lot of hyperscalers talking about their accelerator programs. Uh Amazon notably at roughly $50 billion of run rate revenue on their tranium chips.

[00:25:02]Something I forget the alphabet number but that's a bigger number. Yes.

[00:25:05]>> And then obviously you've got the big guy Nvidia. Um, if you were to draw like a market share chart, it looks heavily in the favor of Nvidia right now. Uh, I'm sure there's all sorts of other AS6 that have not even seen the day of flight yet. Should we expect obviously, you know, the guidance from Nvidia is we're going to do everything. The guidance from the others is is similar. How do you expect this to trend? Is there one or two that you're a particular fan of?

[00:25:31]Um, outside of obviously the the the main workhorse, >> you know, I'm not going to answer that, right? So, but uh kidding aside, uh no, I think the the world needs a much more resilient compute supply chain.

[00:25:48]>> Mhm.

[00:25:49]>> Uh I think it is dangerous for the world to be singlethreaded on any one component.

[00:25:55]>> Right. Um and so I think that is what the market is reflecting. Right. So we are going to see quite a bit of uh choices.

[00:26:05]>> Mhm. And the workload is also going to push it there because the workload is getting a lot more complex than a pure inference or training job on a GPU, right? And so that is going to lead to flexibility.

[00:26:16]>> I'd say the other underappreciated part that I don't know whether everyone in will will will appreciate >> Mhm.

[00:26:26]>> the way TSMC allocates wafers.

[00:26:28]>> Mhm.

[00:26:29]>> Will mean that there have to be multiple GPUs and accelerators. say more about that.

[00:26:35]>> Uh I think TSMC has done been extremely successful because they try to make sure that multiple customers are successful. Okay.

[00:26:45]And it is in their business interest to be so right because they don't want to be single threaded on any one big customer. Right.

[00:26:51]>> Right. And so they I think and that is a single choke point in the supply chain >> and so the way those wafers get allocated there will be multiple people multiple companies which will get wafers there and by definition therefore there'll be multiple varieties of chips.

[00:27:07]>> Mhm. And so for the scale we are talking about for the scale any one of us are talking about Google Amazon us whoever by definition we have to learn how to use all of these chips because we don't have a choice >> right >> right and so that's why I think the world will look a lot more richer in the future >> fascinating fascinating the um you know one of the one of the maybe the other dimension such is training the shape of the training workload as you said is fairly synchronous it it's typically coordinated you need coherent cluster, it goes up right all at the same time.

[00:27:41]Inference on the other hand does not seem that way. It's likely much more spiky, a lot harder to forecast maybe.

[00:27:47]And as that as that changes, uh you might even want more compute closer to the edge to minimize latency for for inference. Talk about that for a second.

[00:27:56]How do you manage um the shape of your your your compute capacity knowing that you're moving towards an inferenceheavy?

[00:28:03]Uh, does that mean more distributed almost cloudflare- like mini clusters closer to the edge or a giant one in Texas or Virginia is good enough? It will get there, but it's not yet and for two reasons. One is uh there are still significant benefits to scale uh on building this this compute. Uh so building 50 megawatts of compute is far more expensive per megawatt than building a gigawatt of compute at one location.

[00:28:34]>> Fascinating.

[00:28:35]>> Uh and >> on a per unit basis >> on a per megawatt basis. Got it.

[00:28:39]>> Right. And that's for many reasons.

[00:28:40]Right. So labor is a big bottleneck around the world today in especially in the US. We just don't have enough people to build these things. So getting the kind of critical human mass you need to build >> you would much rather do it for a bigger scale than for little bits of 50 megawatt spread around the country.

[00:28:59]>> Mhm.

[00:29:00]>> So that I think is going to drive the economics. The other technical reason is the way these models work and especially for agentic workloads. Uh the time to first token is still on the order of 4 to 500 milliseconds because they have to page all of this context in before they generate the first token.

[00:29:20]>> And so 4 to 500 milliseconds is far larger than any latency benefits you get by putting compute closer to the user.

[00:29:28]>> Got it?

[00:29:28]>> Right. And so to me that will also mean that this will push us towards more concentrated clusters of compute for inference >> still for some time. Uh this will change as we figure out how to distill in very intelligent models to be small >> and potentially run closer to you.

[00:29:47]>> Got it.

[00:29:48]>> Uh but at this point the economics don't favor it.

[00:29:52]>> Got it. Uh follow up on that. Could you break down the 500 milliseconds into what are the different components of that call from the time that you know we pressed a enter button on the chat on chat GPT if you were to allocate that 500 millconds who's using that up how much budget is each part of the stack allocated >> I'd say at the 500 millconds didn't even include some of that other components that you were talking about but even for example >> uh you ask a query on codeex it's running off a project Right? Uh it is going to take that prompt combine it with your codebase. Right? That's the entire context for that uh for that model.

[00:30:33]>> Mhm.

[00:30:34]>> I mean there's to get technical for a minute the prefill phase of running the inference. It basically has to run that entire context which could be >> hundreds of megabytes.

[00:30:46]>> Mhm.

[00:30:46]>> Uh like our codeex models now are 400k context, right? So there's 400k tokens.

[00:30:51]>> Mhm. 400k tokens have to be computed through the attention mechanism before the first output token is generated.

[00:30:59]>> Right?

[00:31:00]>> And so that is the basically the model paging in all the context relevant to that task before it spits out even the first output token.

[00:31:09]>> Right?

[00:31:10]>> And so that's that several hundred millconds of latency.

[00:31:13]>> After that you can add other stuff, right?

[00:31:15]>> Prefill the first part is >> prefill. This is prefill. And so after that you can add the other sources of latency that could be like it usually could just be your app >> turning your prompt into a token that is sent to the cloud >> and load balanced into the appropriate GPU to run to the model. All of that is going to add maybe tens of milliseconds of latency.

[00:31:36]>> So that's where I was saying that that first token generation latency is higher than all the other sources of latency.

[00:31:43]But an interesting side effect um when we brought Cerebras in and we rolled out Cerebras earlier this year uh it started generating tokens so much faster >> that all of these other latencies that we had in the system in the app in the way our API works started to become prominent.

[00:32:02]>> Mhm.

[00:32:02]>> And so when we improved one layer of the stack it forced us it actually showed up all the inefficiencies that we had in the rest of the stack. And so we had to do a lot of engineering to fix those latencies. We literally published a blog post on this yesterday.

[00:32:18]>> So we had to change OpenAI's API infrastructure >> to actually keep pace with Cerebras. And so there's if someone's interested lot of very neat software engineering that has gone into how do we shave off latency and every layer of the stack.

[00:32:33]>> What's the name of this blog?

[00:32:35]>> The OpenAI blog.

[00:32:36]>> Open AI blog. Great. Great. Great.

[00:32:37]Great. So it's like a whack-a-ole problem, you know, similar to how folks were optimizing page load times. Yes. On the internet.

[00:32:44]>> Yes. I mean, I think latency is going to be a very important dimension we will focus on, >> right?

[00:32:49]>> Uh I think the trope is true that every 30 or 50 millconds of latency you can shave, >> leads to higher engagement, leads to higher revenue, leads to higher retention. For sure that is true, right?

[00:33:00]And I think that is going to be a dimension on which all of us are going to compete.

[00:33:04]>> Fascinating. That makes a lot of sense.

[00:33:06]And particularly given the attention uh of the human brain is only going one way.

[00:33:11]>> Yes.

[00:33:12]>> Not not expanding.

[00:33:13]>> Yes.

[00:33:13]>> Here's a fun question for you. You know, every guest we've had so far has has mentioned that compute is the biggest bottleneck as an ingredient for their business. Probably true. Um what is the consensus that the AI community has has maybe wrong or not not right enough that you you have reason to believe um is misunderstood? What about AI infrastructure is most misunderstood?

[00:33:36]Right now >> I guess the the biggest shift that is happening that is underappreciated is we have very simplistic systems today. Right. And what I mean by that is >> we have these big compute units attached to one layer of memory which is high bandwidth memory.

[00:34:00]>> Mhm.

[00:34:00]>> Right. And I think we went through this in general purpose computing. CPU started similarly, right? And then they added multiple layers of caching.

[00:34:11]>> They added flash, they added hard drive storage, all kinds of stuff, right?

[00:34:15]>> And so I think we are very early days in how systems infrastructure is going to evolve >> for AI compute.

[00:34:23]>> Uh we've gone from very simplistic ways of programming these things to more sophisticated ways. I'd say the other big shift that's happening underneath is AI is generating the next generation AI infrastructure.

[00:34:39]>> And so what I mean by that is we are increasingly using our latest models to design the next chip.

[00:34:46]>> Mhm.

[00:34:47]>> And the next set of low-level software needed to run the next model.

[00:34:51]>> Mh.

[00:34:52]>> So recussion if you will, right? So how can AI basically figure out what is the right kind of chip system and software it needs to run most efficiently >> rather than this decoupled world today where we train a model someone else is designing a chip independently and delivering to us and we figure out how to make it work. So how do we uh quicken that pace where basically the next model while it is being trained is also figuring out what should be the chip and system design it wants to run most efficiently. We are not that far from that world.

[00:35:28]>> Fascinating. That's a recursion is a recursive algorithms are one of the most powerful algorithms. So this seems like a brave future. It is I think but it is also probably the only feasible way to bend the curve on the compute time right so cycle time compute cycle like how quickly can we get the right kind of compute designed and operational for the next generation >> and and so because otherwise we won't be able to keep pace as as h if humans are going to try and interpret and then design and then do it >> a typical chip design cycle is 3 years >> like from inception of idea ideating on what a chip should be to actually getting it in production is 3 years and that's too long given how quickly things are changing.

[00:36:15]>> Yeah. 3 years is right around when Chad GPT was launched.

[00:36:18]>> Yes.

[00:36:19]>> So yes, >> that feels like forever ago.

[00:36:21]>> It's an eternity.

[00:36:22]>> You know, one of the questions we ask a lot of um our speakers uh is this chart here. Um you know, we talk about the five layer cake of AI as Jensen describes it. energy chips, infra models, apps, you play across all five of them. We're waiting for chips to show up soon from Broadcom and others. Um, if you were to uh guide us based on everything you know, which part of the stack is most likely to acrue value in the long term, what would you point to?

[00:36:54]Obviously, all of the money right now is in the bottom half of this layer cake.

[00:36:58]>> It changes, right? So, I mean, I think uh history rhymes. So if you look at the mobile revolution, initially a lot of the money was made by the telos and the people building the infrastructure.

[00:37:12]>> Yeah.

[00:37:13]>> Uh then it moved up uh into the application layer, the people building the apps.

[00:37:19]>> And then it moved up into the cloud services cloud services layer.

[00:37:24]>> Uh I don't see any reason why this cycle will be different. We are right now in the world where the infra layer is where the profits are. Mhm.

[00:37:31]>> But over time it'll move to the platforms and the apps.

[00:37:35]>> Mhm.

[00:37:35]>> Uh and so that is uh that is I guess the inevitable cycle, >> right?

[00:37:40]>> Oh, we hope so.

[00:37:41]>> It seems that every app is getting engulfed by uh openthropic.

[00:37:45]>> Uh certainly right this second. So, so, so we were eagerly waiting for that.

[00:37:50]>> Rapid fire question for you Sachin before we uh open it up. long short.

[00:37:54]Pick a business. Pick a startup that you're very excited about that you'd go long. And the same on the other side, a counterfactual, a business idea startup that you're uh bearish about.

[00:38:05]>> I'm long open AI. I'm voting with my feet, but keeping a side. Uh no, I think uh I'd say that the thing maybe for this audience that is underappreciated uh the I would go long on the lowest layer of the stack.

[00:38:24]Uh because at least in the US we have forgotten how to build very foundational infrastructure.

[00:38:34]>> Mhm.

[00:38:34]>> Right. And that's from everything like how do we build transformers at scale?

[00:38:39]How do we build batteries at scale?

[00:38:41]>> How do we build generation and distribution? How do we build cooling?

[00:38:46]How do we build components that go into all of these systems?

[00:38:50]>> Uh that is a an underserved layer of the infrastructure.

[00:38:56]>> Fascinating.

[00:38:57]>> Uh that is also one where differentiation is sustainable because it's both technical as well as scale. if you build it, it's very hard for other people to replicate it.

[00:39:10]>> Uh so I'd say kind of a corollary bet on if AI is going to have that transformation that we think it will.

[00:39:18]>> Really the cor this this all this layer has to change from how it's done. Mhm.

[00:39:24]>> And so for people in this audience especially early in their careers, >> I think I I was I was a faculty here as some of you know for 15 years both in W and CS and I saw dwindling enrollments in E.

[00:39:38]>> Uh especially on the lower layers of the stack, right? Especially around how do you do transistors, how do you do materials, how do you do that kind of stuff.

[00:39:45]>> That stuff is what will move the needle here.

[00:39:48]>> And so I'd strongly encourage going long that layer of the stack. Great. Great.

[00:39:54]We have some E students in the in the class. Uh what are you short Sachin?

[00:39:59]What are you skeptical about? What are you cautious about? We'll lower the stakes.

[00:40:03]>> I I in general obviously we I'm short anything that is a model wrapper.

[00:40:10]>> Mhm.

[00:40:11]>> Uh and that's a bit of an easy answer, but it is also true because the pace at which this thing is changing.

[00:40:18]>> Mhm. uh and how quickly these models are able to introspect and figure out how to deliver an outcome.

[00:40:28]>> Uh I'd say that it's very very very hard >> to to just be a wrapper on top.

[00:40:35]>> Um so that is not a statement that open AI or even anthropic for that matter I would say the same >> that we just want to build all the apps.

[00:40:43]I I think this whole notion of apps >> probably to me is the one that I'd be short of >> like is that going to be the user interface of the future unclear right like is it really going to be apps if we are going to >> interact with computing in the form of outcomes this is the outcome I want go figure it out >> today apps are a crutch >> right >> to get to an outcome >> right right >> and so that would be the notion I'd be short of >> fascinating that makes a ton of sense um first company to 10 trillion in market cap if you were to pick one.

[00:41:17]>> The easy answer is Nvidia, right?

[00:41:18]>> Yeah.

[00:41:19]>> Yeah. Yeah. Okay. I thought for a second you were going to say open. Yeah.

[00:41:22]>> The first one you said. I hope I will get there for sure. But I think if we are getting there for sure, Nvidia is getting there.

[00:41:29]>> Good, good, good, good, good, good. Um, biggest unsolved problem in infrastructure right now.

[00:41:35]>> Oh, you you name it. Right. So I think uh so many uh >> I'd say the single structural issue is enough fab capacity across logic and memory.

[00:41:48]>> Uh >> this is TSM at the TSMC layer.

[00:41:50]>> TSMC, Samsung, Intel and then Micron, SKX, Samsung. Uh this it's a very very concentrated market.

[00:41:58]>> Mhm. uh this whole thing is kind of single threaded on a very small number of companies >> and probably if you dig down even deeper it's ASML >> right >> right and that's at this for all of these right you need ASML machines >> so to me that that is the single choke point >> right >> of the whole supply chain >> right makes a ton of sense you already answered my last question which was advice for students if you have anything to add we'll take it otherwise we'll open it up for questions for a couple minutes go ahead Thank you for being here. Um my question is as we go through and you mentioned computer is a very broad term where do you think the next is going to come from? Is it going to be the hardware like memory networking software?

[00:42:45]>> Short to medium term it's probably in the orchestration software the harness and the models getting more token efficient.

[00:42:55]uh medium to long-term I'd say new memory architectures uh because I think the compute unit in it unless the transformer gets reinvented like something replaces the transformer you know what the compute unit shape is right it's really kind of what is the memory architecture around the compute unit that's changing all the time so that would be my medium to long-term answer >> go ahead that you walked away from Stargate EK. goes through some of those positions through the process for >> uh I think for us Stargate is basically Stargate is my job right so it's how do we deliver all of this compute u and the way we look at it is given the size we talking about uh like a gigawatt is $70 billion in spend so these are massive numbers and it's also So operationally a big challenge right it's as I said a gigawatt is half a million GPUs to manage build up staff and all that so I think fundamentally the way I look at this is how do I make sure that it's not just the absolute number it also lands on time as quickly as possible and so a big part in our kind of approach is now time to compute rather than amount of compute and so that's dictating kind of where we double down invest and that's why the earlier question too we we prefer bigger chunks of concentrated compute for that reason because otherwise operationally it's very hard for us to get that compute online if it's lots of little chunks spread everywhere >> go ahead >> I want to ask about open weight models models are proving first part of the question is how you think that that's second part is key Obviously open way models usually have fewer parameters require less comput.

[00:45:03]>> Yeah, I mean I think uh obviously open source models have a role to to play uh in the in the in the ecosystem. U we frontier model intelligence is going to require orders of magnitude more compute right uh I think that we don't see that changing so the scaling loss continuing to hold >> so we will continue to invest on that frontier model intelligence obviously open weight models will play catchup and try and distill that intelligence to deliver it in more uh compact track form factors and we don't see that as an issue but a six-month lead in intelligence is an enormous lead >> right >> uh and so we we we don't see any reason to back off on continuing to invest on Frontier Intelligence.

[00:46:00]>> Awesome folks. We'll wrap it here.

#Stanford #Stanford Online #Artificial Intelligence #AI

Related Videos

Truckers Finally Seeing Higher Rates… But Carriers Are STILL Going Bankrupt

LetsTruckTribe

480 views•2026-05-28

IS THIS THE REAL REASON FOR DATA CENTERS?

PrepperDawg

7K views•2026-05-31

JPMorgan CEO JUST NUKED Mamdani... as NYC's Middle Class COLLAPSES

Englishman-In-NewYork

7K views•2026-05-30

The Dark Age Of Blue Collar Has Begun

derekpolasekofficial

4K views•2026-05-28

What has a broader economic impact, corporate downsizing or ecological collapse?

theratracejournal

1K views•2026-05-29

China Is Quietly Buying Gold, the Iran Deal Is Frozen, and Silver Is Heating Up

RichardHolloway0

694 views•2026-05-31

Why Canadians can no longer afford to survive #canada #inflation #shorts

TrueNorthInvestor-v4j

131 views•2026-06-01

Why People Pay More For Someone They Trust

financian_

66K views•2026-05-28

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30