Cheema masterfully turns the memory-bound bottleneck into a playground for heterogeneous computing, proving that local AI can thrive on a "Frankenstein" cluster of consumer hardware. It’s a brilliant technical workaround that democratizes frontier models for those who prefer clever engineering over paying the Nvidia tax.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Frontier AI at Home — Alex Cheema, EXO LabsAdded:
I guess that's thought simple.
Uh who who's familiar with an LLM?
Yes, everyone. Um who's who's familiar with the concept of prefill phase and a decode phase in uh LLM inference?
Yeah. Um who's used who's run a model locally?
Okay, wow.
That's good.
Uh I I always say like we're very early, but this makes me think maybe I mean, literally all of you raised your hand, so maybe we're further along the adoption curve than I think. Um and uh yeah, this is a a scary one. Who's used Exo before?
Okay. Couple.
Uh cool.
Um so yeah, I guess um just a little bit of background on myself. So, I'm the uh I so I work on Exo. Um we're a uh lab focused on running frontier AI on local hardware.
So, what we're doing is we're looking at uh full stack, what's involved in running inference locally, and uh working across the whole stack. So, on the models, software, the hardware side as well. And um our mission is to drive down the cost of running frontier AI systems locally.
Um so current state of things is like most AI runs in the cloud.
Um typically if, you know, I guess this group uh a lot of you have used local AI before, but typically if you're going to use a model uh most people are using, you know, uh models that are running in data centers.
Um so why is that a problem and why why should we even care?
Um well uh I guess the name Exa actually comes from exocortex. So, this is the idea that, you know, AI will go beyond just being this tool that you use for a chat interface, which it already is, and it's more of kind of extension of yourself.
And if you think of it like that, it's almost like a part of you and a part of your brain.
And then you ask the question, well do I actually want to rent my brain? And I think Andre Karpathy uh has like a nice one-liner that really summarizes well, which is like, not your weights, not your brain.
Um and you know, there's there's something uh quite deep about that statement, I think, and think we're now starting to realize, you know, as things like Open Claw, you know, and agentic systems are getting more popular and it's more than just a chat interface that the um there's a lot of concern about you know, where is this thing running?
Where is my data going?
Um you know, what if it gets cut off, right? Like you know, I had I had a friend that works in cybersecurity and uh he was doing some, you know, penetration testing, which is uh uh pretty innocent uh you know uh doing it for securing a system and basically got locked out of like three of the um API providers uh like Cloud, Gemini and uh and ChatGPT.
Um you know, there's this is becoming more than just, you know, this this uh this chat, right? It's like, you know, you need it to basically be competitive in any field now and there's uh you know, with centralized systems, you have uh you're relying on a few organizations basically for this.
And um I think, you know, there's kind of this there's two realities, right? One is where we have this closed-source world and in that world, there's going to be this massive power law where there's a few companies that basically have the most capable models. And what they'll do is, I mean, they will rent seek on on that because you know, that's how they make the most money. And uh you know, what we believe is there needs to be a kind of a competing force of, you know, keeping things open and actually being able to run these models without needing massive data center, massive amount of compute.
Um so I just want to talk a little bit about sort of, you know, the actual technical side of what we're doing. So um I guess the first uh the first thing to realize here is like a lot of the discussion is kind of around training or has been around training and um you know, that's um been a lot of the focus of, for example, like open-source efforts has been in like, okay, we need to have like um you know transparency around training and uh what I'm talking about here is like specifically inference. So like if you already have the model, you know, how do you run that model? And if the only option is to buy a million dollars of hardware to run like frontier model, then that's like a massive barrier and um that's kind of what we're focused on here.
And um like in terms of the hardware, because the focus has mainly been on training, um what uh you tend to see is that there's this idea of the hardware lottery, which is, you know, the research that's out there isn't actually um you know, the best possible thing that you can do.
Uh there's a lot of research that's being done on the current hardware stack, which is primarily been built around training. And uh what that looks like is basically these Nvidia GPUs uh that you stack up in the data center.
Um but that's not necessarily the you know, the the best thing to do, especially then when you look at inference. So um you know, the the current uh the current hardware is very much focused on like, you know, flops and flops unit economics. And uh you know, there's this I really recommend like giving this a read. Uh it's from Sarah Hooker, who was um she was at Google Brain and and then Cohere. Now I think she's doing her own thing. But um you know, basically the idea is like there's all these ideas out there that haven't really been explored because we just have a lot of inertia behind the current way of doing things with the current hardware. So our thesis is, you know, there's all these things that haven't really been explored uh with different hardware and in the context of inference that, you know, can really um be a lot better than the current way of doing things.
And uh this is really an area that's super uh there's a lot of like low-hanging fruit out there. So, like, you know, with a little bit of um like I'll give you an example. So, like, the other week we were just looking at quant 3.5 and running that locally on Apple silicon and we found that like if you look at the theoretical uh speed that you would get running this, it's like way off what you get in practice.
And uh it was off by like 50%. So, like 50% slower than what we thought it should be. So, we looked into like what it's doing and we found like basically there's like bunch of overheads that is introduced by inefficient kernels and specifically like having a lot of unnecessary kernels that kind of get launched separately, which leads to a lot of uh overheads when you're running inference. And, you know, each one of these um kernels adds, you know, quite a significant amount of overhead if you're thinking about running a model like uh quant 3.5 locally, you know, theoretically you might be able to get like 150 tokens a second. Um which is, you know, a token every less than 10 milliseconds. So, if you've got like a few milliseconds delay here and there, it's like it adds up to a lot. And so, what we did is we just did a little bit of work on sort of looking at okay, what's going on and we realized, you know, there's all these separate um kernels being launched that are unnecessary. So, we just did some pretty basic work to fuse that all together and increase the inference performance by 30%.
So, um it's just an example of like you know, there's a lot of stuff out there that maybe you would think is optimized and you'd think is already um pretty close to you know, getting the best utilization out of the hardware, but it's just not the case.
And um you know, this this exists across the whole stack. So, that's on the kernel level, but there's also a lot of stuff you know, in the orchestration in terms of, you know, how you connect different piece of hardware, you know, communication overhead, even stuff like the harness, right? So, like I think what people are starting to realize is actually there's a lot of value in the harness and you know, for example, if you use closed code with a certain model, and then use open code with exactly the same model, you get completely different performance.
And you know, this is specifically interesting in the context of local because you're super resource constrained.
So, anything you can do on the harness layer that kind of is aware of the hardware, you'll be able to get a lot of gains there. So, it's just across the whole stack.
So, broadly speaking, like I said, training is about flops, right? So, that's basically everything is compute bound.
And what matters then at scale is basically how cheaply can you get flops and, you know, how much energy is it going to use?
But inference is mostly about memory.
So, um you know, basically everything all the operations you're doing, depending on the model, is like most of those operations are memory bound.
And specifically like when you're running stuff locally, if you're running so- something for yourself, you don't have the kind of um you you don't have the ability to batch together, you know, multiple users' requests. So, everything you do is kind of uh low batch size, which is memory bound. Why is that? Well, yeah, basically this is kind of how uh inference looks broadly speaking with most models today. So, there's kind of like a pre-fill stage and a decode stage and my argument is prefill doesn't actually matter that much especially when you're running the stuff locally and the reason is basically okay so just explain like prefill stage is loading your contacts with your prompt so you know if you're loading in a PDF or something it's the part that actually generates all your KV caches and then you have the decode stage which is auto aggressive and that just generates token one by one that part's memory bound prefill part is compute bound with the with the prefill part what you're seeing actually is like and again this goes back to this idea of like you know the harness matters a lot so a good harness what it will do is it would get a lot of cash hits so it would keep the prompt mainly the same and if you look at if you type in a slash context when you run code code you can actually just see this so you can see that um you know it would basically show you like uh all the parts of the prompt and you can see there's like a big part of it that is just the system prompt and system tools and that stuff doesn't really change so maybe if Claude you know maybe if they push an update this will change but broadly speaking you can you know keep most of the prompt the same when you're running you know actual workloads end to end you know maybe in the benchmarks you'll see people that are doing stuff at like you know really long contexts really long prompt sizes and you know my my argument here is it doesn't matter actually as much as people think so you know really it's about decode and what matters for decode well it's three things so like I said it's memory bound but you know the first thing is you have to actually fit it into memory so um if you want to run a model at a good speed, then if it doesn't fit into memory, you're going to be loading it from disk, which is super slow.
Uh, so that's like the first like hard requirement, basically.
Uh, second is memory bandwidth. So, this tells you how fast it's going to run.
So, how fast can you actually load, you know, your model weights, load your KV KV caches into GPU? Um, and the third thing, and this is something that um, is particularly like uh, important locally is, uh, the energy. So, like, you know, um, here I I I'm talking about energy per byte, so I'm talking about like in these memory bound in this memory bound decode phase, like how much energy does it cost to move one byte one gigabyte?
Um, and that tells you basically, if I'm going to run an inference, like how much power is it going to consume?
Um, yeah, this matters a lot. I mean, I don't know, like, um, there's a lot of these kind of demos of doing stuff on phones, uh, that you see on Twitter.
And, um, we did a lot of stuff with phones, you know, a while ago, maybe 18 months ago. Um, but we found like there's a big issue, which is the energy. Uh, and, you know, the batteries on on phones are quite limited. So, you know, maybe you're talking about 10 to 15 watt hours, uh, on like an iPhone. And, uh, inference, you know, it might be consuming something like 10 to 15 watts, right? So, then you're talking about like 1 hour of battery life. Which is, uh, not really usable. So, of course, um, you know, there's a lot of work being done to improve that. And I think phones eventually will be able to run, you know, better models on phones. But right now, it's just like, you know, it consumes a lot of power, and it also gets really hot. So, actually, when we were running benchmarks, um, like 18 months ago or so, like, the phone was getting so hot, um, that I couldn't hold it. It was actually like too hot to uh to hold.
Um So, yeah. These are the the the three This is what these like actually map to in practice. Um and uh on the So, like there's this uh paper that's pretty recent um and uh it's from a group in Stanford uh hazy research maybe uh some of you know them from their other work like uh ThunderKittens and stuff.
Um But, like now they actually have a group that's focused on this concept of intelligence per watt. And the idea is this kind of encompasses these three things and it's uh a way of like tracking like how are these things improving over time?
And the metric is um you basically look at like how good is a model at a specific task. You divide that by the energy that it uses.
And if you track this metric, uh you see that you know, this is actually improving um exponentially. So, over the past 2 years uh it's been about 5x.
Um the correct term for this actually shouldn't be intelligence per watt, should be intelligence per joule really because you're not really um concerned so much about the time here.
You're concerned about okay, for a given task, you know, how much energy is that going to consume?
Um Yeah.
So, yeah. This is uh This is now looking at uh sort of memory improvement. So, this is actually like I mean, the fact that you can buy a commodity piece of hardware like this that has I mean, Apple got rid of the 512 GB option recently, so you can't buy five These are 512s. Uh but uh you know, you can buy a 256 GB and it's like, you know, you can buy off-the-shelf basically. Um So, you know, this this is like kind of a very new thing. Like you wouldn't you weren't able to buy consumer hardware that had this much memory in it. Um so, this is like improving dramatically and you know, uh yeah, give an example as well like the new MacBook M5 um Max has 120 like you can get it up to 128 GB and the memory bandwidth is pretty pretty damn good. Um 614.
So, yeah, this is kind of uh looking now at the intelligence per joule.
Um so, uh do do do yeah.
So, it's actually So, the 5x from the hardware improvements then you've got another 3x from uh model improvements and obviously these things compound.
So, what you're seeing is that um a lot of the uh improvements here are coming from, you know, hardware, right? Um and also uh the model layer. So, this again goes back to the idea that, you know, it's it's kind of about looking at the whole stack, right? There's a lot of gains across the whole stack and uh there's different things you can do.
Uh Guys ask where we are on the demo.
screen Okay, maybe we can ask.
Yeah, cuz I want to show you this as well, right? Uh I don't I don't just want to talk. Also, if anyone has any questions or anything, yeah, you have a question? Yeah, so just in terms of the current gen Mac side of things, like where do you see this going in real life? Like I I'm trying to tell my friends like think of this like an appliance in your house you're going to go spend money on a fridge.
I want to be this. And they look at me and they're like telling me to spend five grand on a box, right?
Um Yeah.
>> Where's this headed in in of like the consumer appetite? And do you know, is it purely just upstream supply chain issues or is Apple seeing this as a new profit center and maybe re-positioning pricing strategically? Like Is Apple seeing as what, sorry? Like a new a new blue ocean segment that they can go sell into. Because I think Yeah.
>> I'm just what I heard about the supply chain is that when these are built, they're already funded. So like both of those essentially get all their commercial contracts in for RAM and then they'll price the different tiers.
And I think what's interesting is the pricing changed to bump the storage and have kind of reasonable RAM.
Mhm. But more of more of the funny theory is that I see you as the a buyer of, you know, a lot of this hardware.
Yeah. Curious what your thoughts are around like the future of pricing and just like consumption as a consumer.
Yeah. Um maybe I can just repeat the the question for the for the mic. Uh the but the question is uh where do you see where do you see like, you know, the hardware going and, you know, are people actually going to be spending $5,000 on some kind of inference machine?
Um So my thesis is this, you know, like I said, um you can look at certain metrics like intelligence per what, right? Um and what you're seeing is that progress is exponential also for local, right? And um you know, right now today, you need something like this, right? If you're going to run frontier models, then basically your only option is, okay, first of all, the bar is always moving because, you know, maybe now the frontier open model, you know, we have GLM 5.1, which I want to show you as well, uh that came out yesterday and that's probably the frontier model now for for open source.
And if you want to run that, it's a trillion it's a trillion parameters.
Um and natively it's uh it's FP16 as well. So, like um you're talking about like 1.5 terabytes or something, right? So, you need to like fit all of that into memory, right? Going back to the first thing.
Um and for that, you're talking about, you know, let's assume that you still have the 512 GB Mac Studios like these, then talking about like $40,000 of hardware, right?
Um and even then, it's not going to run that fast, right? So, like it's kind of, you know, maybe going to run at something like 20 tokens per second if you're doing this kind of setup.
Um that's not acceptable for a lot of people like, you know, I think people are used to a bit more than that now with uh the cloud, so maybe something like 50 tokens per second is kind of what people are accustomed to.
Um so, that that's like the the current state of things, right? But, um you know, our our thesis is this stuff is going to um there's a 100x in there. So, like, you know, if you look at like all these parts of the stack because they compound, um if you're making like changes to the harness, if you're making changes to the the models, if you're making changes to like the kernels, all of this stuff together, um there's still, you know, like um a 100x in terms of like price to performance in there. So, where I think things are going is like quite soon you will like I think within uh within let's say 18 months, you'll be able to spend $5,000 or $5,000 and have close to frontier level performance running quite fast.
Um but it's going to take kind of this co-design, right? Of like looking at things across the whole stack. You already see that happening in a data center.
Uh, so you know, um the Grok acquisition by Nvidia, you know, you're seeing like more specialization, different chips for different things.
And this idea of extreme co-design.
So, looking at everything together and seeing like, okay, how can we build this these things in a way that uh you know, gets the best performance for like the end use case, right? And I think things have consolidated in like even though things are moving quite fast, like some things that don't seem like they would probably change like, you know, having like uh agents seems to be you know, the the paradigm and it will be the paradigm for a while.
Um I don't really see that changing.
Uh you know, you see like massive MOEs, right? So, like that also seems like that's kind of consolidating and that's why Nvidia can actually now go and say, okay, we're going to build like specific hardware for this architecture because you know, they know kind of more about how these end use cases are going to work. And uh yeah, so I would say like maybe now I wouldn't actually recommend to like friends to say like, oh, go spend $5,000 unless they want to experiment. Um but I would say within, you know, definitely within 2 years, you you will actually have some products on the market that you can just say, hey, go buy this box and instead of having all these subscriptions or you know, now if you want to use open core, you can't even use that with you know, the the Opus API, right? Um so, you're basically going to be spending probably I know people who are spending like thousands of dollars a day on uh tokens at the moment, right? So, instead of that, um you know, just say like, buy this hardware and uh you know, you never have to pay for a token again.
Uh it's basically free, right? It's just the electricity cost.
Uh, so yeah, I I'd say in 2 years.
Yeah.
I have a question.
>> Yeah.
Yeah, it's an estimate. I mean, but yeah.
Yeah.
Yeah.
So, the the question is like more about the macro of like, you know, where the models are headed and and uh, you know, you you have like a lot of progress in making these models bigger.
So, like the rumor with Myth Mythos is like 10 trillion parameters. I think that's the rumor, like 10 trillion parameters. So, like um, and then you also had rumors about Gemini being something like 4 trillion parameters. So, like and then maybe the next model run is like 20 trillion or something. So, like you know, you're seeing that the models are getting bigger. At the same time, Gemini 4 just came out, which is like tiny, and it seems to be better than, let's say, the the best model from 2 years ago.
Right? So, like you know, where are things headed? Um I think it's all of all of the above.
Like, I think there's there's going to be like a Well, there already is, and there's going to increasingly be a massive demand for compute. And there's always going to be like a lot of progress being made in the data center to just, you know, pack more compute, and um I think what you'll see at some point is maybe things will bifurcate. So, like there'll be like all these things that you can run locally.
And uh you can maybe do like 99% of things locally, because if you look at like different use cases, all these use cases kind of follow um an S curve, where like if you look at how intelligence is is is changing, then at a certain point there's massive diminishing returns on running a more intelligent model for a certain use case. For example, I don't know if any of Has anyone used Whisper Flow?
Yeah.
So, like as as far as I understand, like with Whisper Flow, there's like a little bit of intelligence there to do like good transcription. And um to me, like having like a 10 trillion parameter model go First of all, like having a 10 trillion parameter go reason for few minutes wouldn't work in that use case, cuz you need it to be low latency. But secondly, I don't think there's actually much of a return there on the utility that you're getting as a user.
Um there seems to be like a threshold at which, okay, if you have enough intelligence, the transcription is good enough.
And I think every use case will kind of follow this, right? And you can point to like loads of other things that are already kind of like this. Uh so, summarization, for example, um or, you know, something like uh creating a to-do list or, you know, summarizing emails, like simple things like that. It's like, do I really need this massive model? Like, no, I think there's like massive diminishing returns. So, that's going to happen and then you'll have a point where, okay, there's still use cases where you just do need a lot of compute. For example, if you're going to like cure diseases or whatever, then obviously, you know, you're going to want like a really intelligent model and spend a lot of compute on that. But, um most things that like especially consumers use, they won't need it. Um and uh that's where I think most things will be able to run locally.
So, that's where then you have these two cases, right? Like, either you need to spend it like a load of compute and you need like billions of dollars of compute to do, you know, this insanely complex thing or, you know, you can just run things locally.
Uh that's my thesis. Um I think it's hard to like the the I guess uh a one of the thing that uh Karpathy said recently is like the fog of war is closer and closer uh getting closer and closer. So, like, it's more there's more and more uncertainty. Like, for me, I've been quite surprised a few times over the last 2 years. For example, I think the biggest like inflection point for me was uh you know, the adoption of Claude Code and Opus 4.5 and you know, a lot of people uh came uh you know, uh back to work after like the Christmas and used these latest models and they're like, holy [ __ ] like things have really improved. And that was the case for me, at least. It was around Christmas time when I was, you know, that was the first time actually I used uh Claude Code. uh and then you know, I was still kind of skeptical at that point that it would be able to do a lot of the work that we were doing, but then it seemed like, oh, the frontier has moved quite a lot. There's a lot of stuff that just wasn't possible before with with the older models and now it's suddenly possible.
Uh and a lot of that was also the work on the harness and and stuff like that.
Um so No.
It is just a lot of uncertainty here as well. Like uh I mean, who who knows like the next generation of models what they're going to be like. Um but I think generally speaking though, this will be true that you know, there's not kind of just unlimited returns on intelligence, right? There's a certain point where you just don't need any more intelligence to do something. And uh if that continues to be the case, then it doesn't really matter where things go. Um you know, as long as the kind of progress continues, then you'll just be able to do more and more stuff locally.
Uh Yeah.
Um going back to the consumer uh $5,000 inference box what I think what you're saying is that we'll get to this point where local models slightly newer hardware will be good enough uh smart enough for most of the consumer use cases. So, we're actually kind of getting to this point where the standard use cases now that we might be using frontier cloud models for work done on a Mac Mini style device.
Not a Mac Studio.
Uh but we'll still look for more compute uh in the future because it gives us better inference because it gives us competitive advantage or we're solving frontier problems. And that's what pushes the curve at that point where consumer demands might like button for for your prints.
Um I'm interested in like another factory of play which is that Everything's very new at the moment so we haven't seen a lot of like model on a chip come through.
I think at a level that really meets people's needs pretty fast but I think as we like as we have overweight models that stabilize on a smartphone enough put these things in cases like I think that will eventually come to potentially be a sufficient to alternative.
You know in a world where maybe the cycle of obsolescence actually slows down a little bit and we can then get better better use out of that. Is that Is is that a an interesting hypothesis or is it started in somewhere?
So what is the hypothesis? So the hypothesis is that >> that perhaps like everything's changing so quickly now that we don't have time to make good model on a chip.
Oh. Everybody's moved on to say like the next button is already. Are you talking about like specialized chips for specific models? Yeah. So like say 23 or two. Yeah.
Oh yeah. Um okay so question is about uh things are changing really fast right now but maybe if things kind of slow down a little bit or stabilize and consolidate then the best uh thing is actually to have specialized chips.
Potentially yeah. That's the thesis.
Yeah.
Uh yeah it's a really good question.
Um Like Talos. Yeah like Talos for example.
I mean Talos is on the extreme end of like you know literally hardware built specifically for a certain model, then you have I think you have a spectrum, right? Like you have GP GPUs, like general purpose GPUs, like Nvidia uh RTX or something or H100, right? Which is like it was meant to do, you know, be this platform that you can basically do everything. And then on the other end you have Talos. And then in between you've maybe got like Cerebras, Groq. Um you know, I think for now at least it didn't really make sense to um build these specialized chips for LLMs because they are changing so quick.
So the frontier is constantly moving, but it is really interesting, I think, if you then start to think, okay, well, if you are going to hit diminishing returns on a lot of these use cases and you have a model that's kind of good enough, then maybe it at that point you you kind of have to look at like what is the cost of building that versus the savings that you'll get um using that hardware and deploying it at scale.
Um like the math didn't really make sense at the moment cuz maybe the frontier moves every 3 months. So like, okay, by the time you built the chip, it's useless, right? Um but uh I think that will change. Yeah.
So I think, you know, we're not we're not really like a we're not building our own hardware. Like we are doing a lot of work to like figure out what the best hardware is to use, but um at some point, you know, maybe it would make sense to start doing that or like work with someone who is building hardware because yeah, like especially if you kind of co-designing the whole stack, you can be very opinionated about the models. So it might even be the case that, okay, today there's if you look at the closed labs, right? They have to provide an API that's quite generic because there's all of these use cases people are using it for. People are relying on on it for like all these different things, right? And uh as a result, they have to have this big monolithic model that can kind of do everything.
But you could imagine if you're going to go be more opinionated and maybe things do consolidate and there's certain use cases we know, okay, this is what we want, then you could be more um kind of strategic about specializing the models. So, you could say for example, okay, we're going to have 20 different models and these 20 models cover, you know, pretty much the same things that this big one one big model could do, but like each one is specialized for a different task.
And uh maybe then what that looks like if you then look at the hardware is you would have a few chips that are specialized, you know, can run those models really efficiently.
Um yeah, I think that is yeah, that that that that should um that should be where things go, right?
If you assume that, okay, there is going to be this um uh things are going to flatten, right? And there's, you know, for a consumer you can do 90% of things locally, then I think that thesis will probably play out. Um yeah, I I I I I I uh I think it's a really good point, yeah.
Cool.
Yeah.
I mean, I like this. I think I think this is better than just talking. Yeah, I think I have a question on the hardware side of it.
>> Yeah. Because um if you compare the Nvidia GPUs versus the Apple's metal hardware, um Compare the Compare to the Apple's Yeah.
>> metal hardware, the rest of the architecture. So, if you load the big model into obviously they have big memory, but if you load the model, they can fit into it, but if when you actually run the inference, the hardware get like degraded so so quickly. Like you can't get that much even if you have like 128 GB Mac compared with the 5090 RTX, you can get way better inference on that Nvidia hardware compared to the Apple hardware.
Did you feel something like that?
Yeah, the question is uh you you can fit stuff into memory on a Mac, but it's slow. So, like when you run things on RTX, it's much faster. So, did you see this and uh what do you think about that? So, yes, uh I agree with you completely like um these these things are very different, right, in how they're designed. So, the Mac is kind of unified memory. It's a big pool of memory, and it's maybe not as fast in terms of memory bandwidth. It doesn't really have that much compute.
Then you have the RTX. Let's say RTX 5090. You're talking about If you compare that to a MacBook, it has way less memory, right? 32 GB of VRAM, but it's much faster, right? So, I think it's like GDDR7, which is Yeah.
close to 2 TB a second.
Uh maybe like 1.5 TB a second, whereas a Mac, you know, you're talking about maybe let's say this Mac Studio, right? 512 GB. So, it's like more than 10 times the memory, but it only has 800 GB a second memory bandwidth. So, about half.
And it has about 10 times less compute, right? So, you know, I think um that's all true, and uh my our thesis is basically you want both.
Um so, um I talked about these two separate phases of inference, right? The pre-fill and the decode.
But uh you know, like we tend to like look at these models as just a bunch of layers, right? That you know, we don't really go much deeper often, especially like you thinking about you know, running a model, a lot of people are just like, oh well, you know, I just need a certain amount of memory bandwidth to run at this speed or whatever, but like um there's a lot more going on under the hood, right? Like, you know, if you start to look at the architecture of these models, then there's a lot of things happening there, and my thesis is you want to actually I mean, this is already happening in the data center, but you want to kind of run different parts of the model on different devices. So, the The reason why I asked because I got this one 120 gig uh black I got 220 billion model running here.
If I run this inference for 15 minutes, this laptop is useless.
It's so hot, but even if I plug in, battery get like drained within 5 minutes, 10 minutes.
It's useless. It's like basically compared to if I run the same inference on like RTX 5090, I can get much more power. I can Yeah.
>> a big model, obviously. I can get like 30 billion models, but inference I get is amazing.
Um so, that's why I asked question your experience with this kind of Yeah, so somewhere here um Can Can we show the Spark? Yeah.
So, like uh actually Nvidia doesn't generally sell their own hardware, right? Like, they actually usually partner with OEMs that then sell the hardware to the end customer, but in the case of the Spark, the the one you've seen is probably the Nvidia one cuz they made an exception there. They have like Founders Edition, which is um it's not this one. So, this is ASUS one, but it's the same thing, basically. Um and uh this costs about $4,000.
And uh One thing we did recently is um combine this with a Mac, right? So, basically, this is just a very simple thing you can do, right? So, you can basically run This has a lot more compute. So, you can run the prefill phase of inference on here, and then this has a lot more memory bandwidth, so you can run decode there.
Now, with the RTX, it's like different because it actually has more memory bandwidth, right? And more compute, but it's a much smaller pool of memory. So, but that's actually um kind of fine.
Um So, without going into like too much technical detail, like you can basically get more granular and start splitting up the model in different ways, and you know, that's actually the most cost-effective thing to do. It's kind of crazy that you have to do this at the moment, like that there isn't actually some hardware that just has it all.
Uh but that that is the case today, right? So, we Today, like the optimal thing to do if you want to run models locally is actually to do both.
So, you would have a Mac Mac Studio or MacBook.
Uh the other way around. So, but but like, you know, there's more that you can do as well in terms of splitting up the model, and you know, we're going to like release some stuff soon which makes it really easy to do this.
So, imagine, you know, you can just plug in an RTX directly into your Mac and get like a 3x speedup, right? On running large models. That's the kind of thing that we're working on, and um again, it's kind of analogous to what's happening in the data center, right? You've got like with Nvidia, you've got these Groq chips, which are running part of the inference, and they're like very, very high memory bandwidth. And then you have like tons of them, so you have this massive pool of memory, and you pair that with Nvidia GPUs.
And uh same thing's happening with Cerebras is doing something like this as well now with Trainium AWS Trainium chips.
Um There's uh Um Yeah, there's a lot of like interesting work to be done at data center side, but like we actually We think the same thing will happen locally. It's just like the software isn't quite there, and the hardware's a bit awkward. Like the fact that you have to stack these Macs like this and connect them all with Thunderbolt cables and stuff, it's like a bit awkward.
At the moment, yeah. But like if you see, for example, um with your RTX, actually you could try running Gemma 4, the dense version, right? Should Should fit in You might have to quantize it a bit, but should fit into memory, but like the That's also interesting, right?
Because, okay, you know, MOEs maybe the current paradigm, but maybe it makes sense in some cases to run dense models locally as well, right? So, and in in the case of dense model, if you run that on a Mac, that's going to be slow because um now, okay, this the ratio of memory to memory bandwidth on the Mac is very high. So, like whereas on the GPU, it's you know a a lower. So, um basically, uh it doesn't actually benefit you too much to fit this whole model into memory now anymore. That's like very sparse.
Like if you want to run a dense model, you just want as fast you want it like a small bit of memory that can run it really fast. So, uh in that case, actually, you know, the best thing might be to just run it on the RTX, right? So, like this kind of depends where things go, but like I think at the end of the day, you'll want to you want both.
You'll want both. Maybe even like other things. Like maybe, for example, you know, you'll have a specialized chip or whatever, and maybe we'll have maybe we'll have SRAM locally.
It's uh at the moment, it's like it's it's well, it's super expensive, right?
And it's also like the density is very low of the memory, so you can only have a little bit of it, but maybe that's enough for a lot of use cases, right?
Maybe you can run a smaller model really really fast.
Um so, like I just think it's not so much like Mac versus Nvidia or whatever. It's like at the moment at least it's like, okay, actually, you want bits of bits and pieces of all these things.
Um and that's that's the way you get the most um price to best price to performance.
So, can you give us just like 1 minute on how do you achieve that today? Like let's say I just have this So, like I have infinite swarms of hardware around me or around the house, but I'm not using anything to host it. Yeah. And I'm running one gateway that's routing this request Yeah.
So, yeah, so it's really awkward to do at the moment if you were just like using the existing tooling out there.
Uh but that's kind of one of the problems that Exo solves. So, Exo is just an app that you can install on every device, and it runs in the background, and it will automatically discover any other devices that are connected.
Um and it works in a mesh network, so you can connect things however you want.
And um, basically the software like Exodus software figures out the best way to distribute your model depending on like what hardware you have.
So, that's our goal with with Exodus make that really easy, right? And then, you know, for us like having these heterogeneous setups and stuff is you know, we can solve a real pain point there because it is you know, you even talking to people at Nvidia that have tried doing this kind of stuff locally, it's like very awkward. Um, and uh, it's you run into all sorts of networking issues and stuff like that.
Like we want to make it like time keep basically. So, you install this app and that's it. Connect things however you want.
Uh, hopefully we can get it done down and I can show you how that actually works.
They're they're already. Okay. Yes.
Good. So, we have we have uh, 4 max studios. Um, and I'll show a demo of how that works exactly in a in a second. Um, but uh, yeah, the the problem is like I'm kind of like hesitant to say like, oh yeah, just go get all your hardware and do that because like I said, today it's especially if you're just picking random piece of hardware, it's like there's uh, not much you can do, uh, especially like if the hardware is not doesn't have GPU, right? So, if you're just taking like Raspberry Pis or something, it's like you know, you you again, I always see these demos on Twitter of like, oh look, I combined these Raspberry Pis and ran this big model.
And then, you know, you look into the details and probably they've like quantized it heavily, it's not actually that useful, and you know, it's probably like really slow on the prefill. So, like it's not like really usable.
Um, but uh, if you happen to have GPUs laying around, then I think, you know, you could create a cluster and have something quite capable.
So, where's your philosophy versus zero-trust networking? I guess that's that's the angle that I'd love to understand. It's just To me, it feels like the solution you're building is how do you get hardware useful in that context?
But, in terms of exposing server network, right? We're talking about like using an iPhone or something similar, but it's you're doing offloading inference to something like that's in your control.
>> Yeah, yeah. Like, again, like doing inference on a phone, I don't think is that that would is many years out, I would say. Um so, yeah, like as far as how that would be used, yeah, you're going to use like So, with Exhale, it runs um exposes a HTTP like just a API endpoint on each device that you run it on. And then, you know, you can use something like Tailscale. Like, we use Tailscale um to access that remotely.
Um they solved that problem really well.
Just like securely accessing your local device.
And then, yeah, you can be anywhere, right? So, I can be here and then have my cluster at home and chat to it with an app, right? Like, whatever.
>> bundling or are you bundling with all the native Tailscale primitives or is it on Uh we're not we're not we're not bundling with Tailscale. Um but we might have um our own kind of solution for this soon.
Like, um it's uh It depends like how much of a pain point it is cuz I I I don't think it's that hard to like set up your like especially if you're setting up your own cluster and stuff, like then having your own Tailscale isn't isn't that much, but like something that we we've experimented with, you know, a while ago, and uh it might be something that we do.
Um Yeah, but I I I do think this this is kind of how how you how you make something like this usable.
Right?
Um you know, you want to be able to access it on the go. You want like like what I imagine is you have your cluster at home, um or maybe it's just one box, right? And uh it has access to all your data, right? And you can securely access it from your phone anywhere. And it's running something like open claw, so you can just like other things. I mean, you can do this today, right? It's just expensive.
Um but, you know, if if you're talking about now that setup where you can just have that, you don't have to worry about, you know, these privacy concerns of like where your data is going, um and you can pay like $5,000 for that, you know, I think there's a pretty there's a lot of people that would buy that.
Cool. Uh I think we can maybe try demo.
Uh Is there anything else Yeah, I guess like one thing to keep in mind here is like I mentioned it earlier, but in the cloud, you can batch. So, like in the clouds, you've got like you know, you might have like a million users or millions of users using your model, right? Which is the case with something like Claude or OpenAI. And so, you kind of have the benefit of just like being able to take all these requests and efficiently schedule them, also known as batching.
Um this is kind of how it works, right? So, instead of doing inference each inference sequentially, you can do them kind of together, and you get these like really nice um economies of scale.
Uh you can't really do that locally because you might be a single user.
Um and uh yeah, this is kind of how it works. You can basically climb up this like as you increase the batch size you can climb up this uh this line which allows you to get better utilization out of the hardware.
Um and uh especially with the data center GPUs, this works really well.
Um but I would argue actually doesn't matter um so like uh I think this is one interesting thing before we go to demo I just want to talk about is just like I have like three um reasons why I think um this so like obviously if you can batch, your unit economics are going to be way better in the cloud, right? Which is then okay, like if your unit economics are 100 times better in the cloud than local, well then local is always going to be super expensive relative to what you can do in the cloud, right? So that's kind of the argument. But I would say um actually um you know, there's going to be some level of batching even if you're a single user locally. So first thing is like multi-agent, so uh I don't know why this was partly in Chinese, but uh this is like Grok 420. I guess it's not in beta anymore.
Uh so it's actually out, but basically it uses four different agents. I think even more now. I saw I was using it the other day and I saw like you know, it shows you like what the agents are doing and it seemed like there were more than four.
Um but basically you're actually running now as a single user. If you do one request instead of it just being one pass through the model, it's like these agents that are kind of you know, collaborating and they're all running together. So uh if this is the the paradigm, then you actually you know, you're you're not going to be running stuff at batch size one, maybe you're running stuff at batch size eight locally and you know, with the the kind of characteristics of the hardware, especially with something like a Mac, then you're able to get really good utilization at that kind of batch size.
Um second thing is uh test time scaling. So actually the first thing is a form of test time scaling, um, but I think there's a lot of interesting work being done on more general approaches that like are like search-based approaches. So, a lot of the current AI, I guess, paradigm is like about learning, right? Um, so, like, how do you scale these models, train them on loads of data, but there's actually another thing that not many people are doing at the moment, which is like, you know, scaling, um, with search. And, uh, what I mean by search is just like simple thing you can do is like best of N. So, instead of running one uh, pass through the model, you do like 10 passes through the model and you know, have some way of picking the best one. And you can train a model that kind of can basically verify these responses and score them. And then you, you know, you can do more sophisticated forms of search. So, like, there was a hugging face, I don't know if I have it here. Is this it? Um, yeah, there was like some research that came out of hugging face that, um, basically showed that you can run a, so, I believe this is a 1B model, um, and it's looking at the accuracy of the 1B model as you scale test time compute. So, uh, with various different methods, right? And basically, what you see is like there seems to be some scaling law here in terms of you can run a smaller model and do more test time compute, more search, and get the same performance as a bigger model.
So, you know, maybe the next paradigm will be something like this as well. And then again, you can start to batch, right?
So, instead of batch size one, it might be batch size eight or something.
Um, the first the third one, which I think is, uh, really interesting and I think it will probably hit an inflection point maybe this year, is continual learning.
So, this is the idea that instead of just, you know, training your model up front and then you know, at inference time you're just doing forward pass through the model, you're actually training the model at inference time as well. So, um Do I have something for this?
Yeah, so so basically the idea here is um what you might have is like everyone actually has their own version of their model weights based on how they use the model and you know, their own data. And uh you know, this is kind of um there's this whole area of like test time training um where you know, there's been a few papers recently that basically have shown that you can get you've got like this long context problem at the moment, right? Or memory problem.
Where the models forget and like, you know, you have to have these different sessions cuz you have limits to context and stuff like that. You have uh test time training you can solve a lot of these problems because now there's no such thing as context anymore cuz you're literally like updating the model weights as you're using the model.
Now, this would completely break cloud unit economics because now you can't batch. So, you wouldn't be able to do this anymore because each model is actually a different model. So, you know, depending on how this lands cuz there's scenarios we're here where it might just be a small part of the model that changes with test time training, but on the extreme end, if the whole model is changing, then you can't batch at all. So, you know, that would basically put you know, local will get like 10x better in terms of relative to cloud if this happens.
Uh anyway, I'm going to stop. I have more slides, but I'm going to stop there and try and get demo if we're ready. We're ready for a demo?
>> the Wi-Fi is working. Just a second.
Okay.
Any more questions in the meantime?
Yeah. Just like what what's your like general uptime from owning that setup?
This setup here. Like how much do you get like 100% Uh this? I saw some post on Hacker News that someone like wanted to rent out their spare capacity. Like how much do you personally Yeah. use it 100% of the time or full time? No. Uh So I I still use um At the moment I still use a lot of like Opus.
Uh and you know, um Yeah, I I I use a lot of like Grok and stuff like um I do use local for something. So um And it's growing as well, right? So like the set of use cases where again it's like good enough is kind of growing. But I would say utilization is pretty low at the moment. So I guess uh what to to repeat your question. So it was about like utilize uh utilize how much utilization do you get out of the cluster? And also what about renting out that spare compute? So this is I think a really interesting idea as well.
It's not something we really focus on at the moment because it doesn't make sense if we don't have scale.
But if we have scale, let's say we have a million Exa clusters, then and everyone's kind of using it for themselves, then the utilization might be quite low.
Uh in which case there's all this spare compute all this idle compute that's just sitting out there. So why can't we make use of that for something? Maybe it could even be you know, for like a uh a volunteer like science kind of um problem, for example, right? Where it's like, "Okay, well, any spare capacity I have, it can go to like solving this scientific problem um which requires a lot of inference."
Right? I think this is a really interesting idea and it's something we'll revisit once we actually have, you know, a scale.
Um Yeah, I think um there's there's a possibility here as well where actually utilization goes up a lot because it even for your local cluster without this because imagine like you can't actually get frontier level performance locally, right? Well, and you can just give it access to all your data.
Like what I would personally want is um I would just want this like because I'm not paying for tokens, I just want this thing to run all the time. And like maybe it can like proactively, you know, tell me about things. It can be scanning the internet um for things that are relevant to me.
It can constantly be like, you know, it could be thinking about, "Okay, future direction of EXO, maybe things to look out for." Um you know, you basically have this like 24/7 agent that can be looking out for you and as like a companion. I think that would mean utilization goes up a lot, right?
And uh you would just want this thing to be running all the time.
So, in that case, maybe there won't actually be that much uh idle compute out there.
Um but uh I think it's interesting. Yeah, once we reach scale, then we start to think about these questions of, "Oh, actually we have the equivalent of the biggest data center biggest data center in the world. It's just distributed across the globe, you know, um is there some way to make use of that?"
And uh yeah, I think inference is quite easy actually to do in this setting because you don't really need much uh communication happening between different machines. Um especially if, you know, if people are already running these frontier models on their hardware, then they already have the capability to run that, right? So, you don't need to do any kind of clustering over the internet or anything like that because they already have this capability. So, then it's just a matter of distributing work uh in like kind of data parallel way.
And that's really easy to scale.
So, Yeah. Um on that topic, do you think there is like some parallelity back when everybody was mining like crypto at home and was building their GPU rigs?
Because then Yeah.
more and more people started doing it, prices dropped, you could argue prices per token dropped, right? So, it's not worth it for lots of people. Yeah. My argument would be it's never going to be worth it to rent out your hardware to other people if you're just >> Yeah.
want If you just want financial gains.
If you're doing it for non-profit, then I I I think I'm mostly So, the the question was like uh do you think it would be like crypto mining where originally it was economical to run your own GPUs and mine Ethereum, for example.
Uh but then eventually, you know, you had these large-scale operations that made it um just not worth it to mine yourself.
Um Yeah, I I I think like um There's again a few things here. So, I mean, some things are different, right? Because >> Yeah. running a local model, you have all the benefits, your data is not going to the cloud.
Crypto miners who just want 100% all the time, it's not like the same incentive, right? But there are some things that seem quite similar in the first light of if you just look at it.
Um Yeah, I I I I think like that's why I I said like it only makes sense once you we reach scale because um I don't think any project that just is its goal is to you know, give you money for renting out your compute going to work. And I think there's already been quite a few failures, you know, especially like think a lot of the stuff out there right now is kind of at the wrong level of abstraction where you're renting out the hardware, whereas what you should be renting out is the use case, right? So like it should be you can go higher and higher up in the stack and you could for example charge for tokens or you could charge for like something even higher level than that like a task, right? And I think the higher level you go, the more you can get creative with how you make use of the hardware. And for example, maybe it turns out that okay, this compute that's out there, um people are willing to rent it out for like very cheap, right? Um because uh it's literally just spare capacity that's sitting there, right? So they wouldn't be getting anything for it otherwise, in which case, you know, if you can you know, if you can basically for example build something that's higher level that let's say is something like uh uh imagine an API where you can just submit a task and it will get done in the next 24 hours, right? So it's not latency sensitive, doesn't necessarily If it fails, can just be retried.
Something like that, you could maybe have an API that's really really cheap that runs on this kind of network.
Um but yeah, I I think like the first thing is first you need first of all you need like people to have a lot of capable hardware, and I think locally I would be a big catalyst for that. And then you need scale, right? There's no point doing this on 100 Macs. It's literally like negligible amount of compute. You know, it becomes interesting when you start to s- you know, look at like okay, what if we had a gigawatt of compute? It's like well, you know, how else are you going to get that compute? There's a few companies that have data centers that need all these You have to go through all this you have to raise so much money you have to have these permits and stuff like this is like kind of a another way to access like large amount of compute. So, it could be interesting. Um But yeah, it's it's not really like something that we're thinking too much about until we reach scale.
Yeah, I think we're almost like we're quite close to time so I want to show the demo. Uh Yeah, is it ready?
I have to SSH.
Should be.
Okay.
No, it's working.
Cool.
So Uh pop pop pop. Okay. So, yeah, we've got four Macs here.
Um and they're all running XO.
So basically how that works is just a Mac OS app that you run in the background on each machine and they Um that's all you do. So, like you just install this app, run it in the background, and what happens is they're connected by Thunderbolt. So Uh I don't know if it's easy to show.
You can see it better if I turn it around, I guess.
So basically um each Mac is connected to each other Mac.
So, they're connected with Thunderbolt 5 which is basically like it's kind of just like uh a wrapper around uh PCIE, so it's like pretty fast.
And um we actually did some work uh um recently um to integrate low-latency RDMA into EXO.
So, prior to that the latency between Macs, because this is just like consumer hardware, it's running this bloated Mac OS, uh it was really slow. So, you would get like 300 microseconds, like 0.3 milliseconds of latency if you wanted to send data between the Macs. Now, problem with that is um doesn't really allow you to split workloads in a efficient way where you can actually scale up. So, what you want to do if you want to actually get a speed up is you want to kind of distribute each layer across um your machines.
Um if you have 60 layers in a model, which is the case with Kimmy or Deep Seek, then that's um at least 60 times that you have to synchronize every time you run one like uh generate one token. So, and it actually, in the case of tensor parallelism, which is uh a way of splitting up um you know, your uh tensor operations across these machines, you have to do two synchronizations per layer. So, that's 120 synchronizations for a model like Kimmy. If that's going to let's say it takes, you know, 0.3 milliseconds, that's a lot of time. That's like 40 milliseconds that's spent just on uh the communication, right? So, that would be really limiting in terms of the speed up that you get. And in fact, prior, like back then, you wouldn't even get a speed up. So, by clustering stuff, like the benefit is you have all this memory that you can split the model across, so you can fit bigger models, but it'll be slower.
Um anyway, long uh story short, with RDMA, it's 100 times faster. So, it's like single-digit microseconds, which now, instead of it being like, you know, 30 mi- uh 30 milliseconds of communication, it's like less than 1 millisecond, right? Which is perfectly fine if you want to kind of scale up these big models. So, GLM 5.1 just came out yesterday.
Is it a trillion parameters?
1.
I think it's trillion parameters.
Um and uh yeah, basically um how the hell are you going to run a like trillion parameter model?
Yeah, it's like massive.
Um so well, the idea is uh you wouldn't be able to run that on a single device, it's too big. So, we're clustering across these devices, and in combination with this RDMA capability, can actually run it faster than it would run on a single device. So, uh it's already loaded into memory, you can see it here. So, GLM 5.1.
And uh yeah, the way XO works is you can spin up these instances of models, and an instance is just um one configuration of a model that's running on your cluster.
So, um it's already loaded into memory, and you can see like the utilization on each of the memory utilization on each of the machines is like 112 gigabytes or something. So, this is this, because of like takes like really long time to download these models, we just downloaded the 4-bit one, um which is still pretty big, it's like what is it? Almost 400 gigabytes.
Um but uh the full model here would be like 1.5 terabytes. It's just, you know, uh it literally came out yesterday, so we just downloaded the 4-bit one, but um yeah, I can just uh chat to the model. So, um does it know about AI engineer? Curious.
I have a quick question about this. Have we converted into MLX?
Sorry? Have we converted this model into MLX? Yeah, yeah, we did that yesterday.
So, basically uh when a model comes out, like there's always this rush to like make it compatible with the hardware.
So, we did uh Leo did that yesterday.
Um downloaded the model, converted it.
Fortunately, with this one it was quite easy because um basically um GLM 5 was already out, right? So, this is just like another checkpoint of the same model.
So, the architecture is exactly the same. We already had support for it. So, with this we just had to convert the weights. Didn't have to Normally, if a new model comes out, like Gemma 4, we had to do a lot of work to get that working. Especially like Gemma 4 is quite different in its architecture and that um the way like the KV cache works is very different to like any other model.
So, we That was a bit of a pain to get work and it Yeah, anyway.
Uh so, yeah, that Did it know about it?
First to a prominent community.
It's founded by Swyx, yeah.
I guess this is recent checkpoint, so it knows about uh knows all about this. I don't know how long AI engineer has been around, but um yeah. So, this is running across these four machines and if you uh So, if you look uh it's I don't know how big Can you see it well? Like you can see it basically the utilization on all of these machines is um is like uh it's it's 100% on all of them cuz all of the machines are you being used in parallel. So, this is tensor parallelism with RDMA. It's a little bit like you can see the response is a bit choppy, but that's because this is I think I'm connecting over Wi-Fi, right? So like it's actually just the connection between my Mac and the cluster. But here all I've done is like this is running tailscale. So we have um a uh machine that has a host name James and uh you know, I just I'm just basically connecting to that uh to the dashboard over tailscale.
Um yeah, that's uh I don't know if we can maybe run a different one as well. This just depends what we have.
Uh da da da Let's try this one.
Quen 3.5 should be fine, right? I tried it.
>> yeah. Uh Uh yeah, that should be fine.
So like I said, you can um you can basically have like different instances uh of the of um of different models.
Uh and you can run these all together.
And um this one is downloaded on a few of them.
So I can try this. So this will be a small much smaller model, so it should run a lot faster. Should be downloaded on these.
So if I launch that, um it will load the model into memory. So you can see on the two machines that it loaded on the memory went up a little bit. So on Mike and S13, the top and the bottom ones.
It's now like higher.
And uh once it's loaded into memory, you can just uh chat with it.
Uh and it should be, you know, much much faster than the other one. So again, it's a bit choppy cuz of the Wi-Fi, but you know, you can see the tokens per second is like 77.
Um and um yeah.
Uh So like if I get it to do something a bit longer, you can see basically the utilization is going up on these two machines cuz they were again working in parallel.
Um So yeah, it's it's pretty simple.
Um but like uh a lot of the complexity is sort of in um just like making it this one app that you can install and you know, not having to do any of this network setup and it just kind of figures out the best way to shoot the model.
I also want to show Yeah. Uh I'm setting up the network.
Okay.
So yeah, I also want to show like why we bought a Spark. Like so there is um uh also the ability to split models across heterogeneous hardware, right? So you can take like Spark and uh Mac and uh in this case obviously, like I said before, this has a lot more compute. So there's some interesting things we can do in terms of splitting up the model in you know, more kind of um granular ways than just like the way that we're doing it here, which is you know, tensor parallel.
Um Cool. Well, while we wait for that, I can go back to the talk. I mean, we have 15 minutes, so Also, if anyone wants to like try we can try this. Like if anyone wants to try if they have Exo or want to install it, it's like exolabs.net and we could try also then adding that to the cluster. So because it's like, you know, super easy to um to add uh devices.
I'm not sure if it'll work over the Wi-Fi cuz it depends on the the uh firewall, but we can always connect it with Thunderbolt. I think if if we have another cable.
If someone wants to try it.
You got it.
Yeah, yeah. So, uh Okay, so Is this running the app on the Yeah, it is running the app, right? Yeah.
The the latest one.
So, let's give it a try.
See if it works over Wi-Fi first. So, Yeah.
So, if I open the dashboard, which should It's not working in Chrome?
So, yeah, the idea is like you can just connect these in any way you want. So, it's like works through a mesh network, so you know, we should in theory be able to just connect this MacBook and have it join the cluster.
Um Yeah, it looks like it isn't working over Wi-Fi. It must be something with this shared Wi-Fi, but we can try plugging it in.
Should be able to plug it in, right?
Let's see.
Oh, you don't have any models as well, so that might be We'll do a small one.
We'll do a small one.
Uh Oh, you have it?
Gemini 4?
I don't know if that will work cuz on this version of the app, um we'll we'll try it.
So, one weird thing with this hardware is you should never use this port next to the Ethernet.
Um we have We don't really have a good explanation for why, but it just doesn't work.
Um apparently Apple is fixing it, so It's fine.
You can use any other port, though.
And all of them are So, like, you need Thunderbolt 5 to do audio mate. What What MacBook is this?
Do you know what chip it is?
M4. M4 Max. So, this has This has Thunderbolt 5.
Uh, do you have audio mate enabled?
Cuz you have to You have to actually So, Yeah, yeah, yeah. I just thought it'd be cool if you can do audio mate, but um So, if we wait a bit, they should discover each other automatically.
Hopefully.
Yeah, you have Exa Thunderbolt, so it should Yeah, okay, it's working. So, you can see like um Actually, the first thing that So, this is just like an architectural thing about Exa, but the first thing that will happen is when a new node connects, it basically catches up on the whole history of what happened in the cluster cuz we use uh um event sourcing, which is uh basically like commonly used in distributed databases well, like where each machine basically writes its own uh append-only log, and then they kind of get merged in some way. And the reason we do this is because of um you know, consistency guarantees across the cluster. So, so if you're working with something that's quite dynamic like this, then devices can come in and out, you know, uh basically whenever. Like, a device could power off, like this could go to to sleep or whatever. If you're working with that, then how do you like guarantee that if a request is going on, it's not just going to get lost?
Right? So, that's kind of uh from the very start is like how we architected this.
um So, it's just replaying that.
Um But, yeah. I guess I was talking about audio mate. So, this uh everything So, like M3 Ultra, which is this, M4 Pro, M4 Max, M5 Pro, M5 Max all have Thunderbolt 5.
Thunderbolt 5 is required for audio mate.
Um and yeah, look you can see it's literally replaying all the stuff that we just did. I don't know if it's hard to see, but um so uh yeah, but the issue right now is like you basically have to boot into recovery mode in the Mac to enable it cuz it's more of like a developer focused feature, and Apple is a consumer company, so uh they don't want this like on by default right now.
Um so, this one doesn't have it enabled, but what we can do is like you know, try basically creating an an instance uh just over the TCP IP. So, it'll be slower, but it should uh it should be possible.
Uh yeah, it's got to the point where we were running the second model now, so it's almost there.
Think it might There we go. Okay, so now you can see like this has popped into the cluster, and you can see it's only connected to this one.
Um which is this Thunderbolt connection.
So, Axle like maintains a live view of the physical topology, and with that it can figure out the best way to distribute the model basically.
So, oh, okay. We did get a warning here that there's incompatible macOS version, so this is on different macOS, but we can still try. Um so, sometimes you have it looks like this on macOS 26.2, whereas these are on 26.3, and sometimes that can be an issue, but we can still um we can still try it. So, you said you had a model, right? We can look at what you have.
Uh ba ba ba Wait, can I put this on the Can I put this >> can.
This work?
Does it have a HDMI?
This mic?
Uh I don't think it does, right?
Do we have an adapter? Do you have an adapter?
Okay. I'll show the dashboard from this machine. Can still do that.
So, yeah. You can see now the MacBook is there, but it says that there's incompatible macOS versions cuz it's on 26.2, whereas these are on 26.3. Uh but actually I can do it from here. So, I don't even need to do it from there. So, that's nice thing as well. I can access you know, any of these machines are running the same API endpoint, the same dashboard. I can just send requests to any of them.
So, you said you had a model. I don't see it, though.
I don't see you You don't have any models.
I mean, it's fine. We can we can download a model. We can just do a small one. So, Ah, I see.
Ah, okay. So, you have it from using MLX. Yeah. So, that puts it in a different directory.
Um it's fine, though. Like what we can do, so you can filter here. So, you can basically select any set of nodes and then it will filter configurations by those nodes. So, if I filter by ones that contain the uh wait.
Is that filter working?
Or whatever. That that looks correct.
So, if I launch it onto here, see how it plays with the with the different macOS version.
Hopefully, it works. Oh, didn't like that.
I I think it like Oh, no, it is just downloading. Okay. So, when you launch a model, obviously first it needs to be downloaded.
Um which is actually quite a big pain point with this stuff. So, if you're running stuff locally, obviously you need to have the entire model weights.
That means, you know, sometimes you need to download like a terabyte.
Um which yeah, you just need high-speed internet, basically.
Um and obviously in in the case of like when you're splitting the model across machines as well, like you basically need pretty much the whole model weights on each machine. You can maybe get smarter with like putting parts of the model on different machines.
Um but it's kind of difficult because now if a node dies, then the part of the model that it's responsible for will be different.
So, you basically need the whole model on every machine.
Uh so, if this finishes downloading, we can run an inference. So, this is Qwen 0.6B, which is basically as small as it gets.
It's only like what is it? 0. 0.3 GB.
Sh- shitty Wi-Fi, I guess.
>> Can you see this uh existing model uh with Xcode XL?
You can use it, but it's it would need to So, like right now, you have to like manually migrate the models over to EXO.
So, if you're using a different app like llama.cpp, it won't be uh possible to easily uh transfer it, but we're going to add actually that so that you can just If you have models already from a different app, then it will just EXO will recognize it. But, it's downloaded.
So, it says it's ready.
Uh So, this should be running on the Mac.
Yeah, so you can see Let's do something longer.
>> Okay, maybe the utilization here is broken, but uh yeah, basically this is running on the the MacBook here.
So, yeah, the idea is like I mean, you get the idea, right? You can basically just like run this application on any of your machines and connect them any way you want, and then Exa will figure out the rest.
Uh You have it? Prefill decode? Okay. So, next thing I want to show you is like, okay, what if it's not just Mac? So, what if you have different hardware? So, I'll just quickly show you like you have the Spark here, and you have a MacBook.
Do you want to connect that to the HDMI?
Yeah, sure. Uh which which It's this Do you have HDMI port?
>> Yes. Okay.
I'm just trying to get But it it is it running? Like, can we try it or Yeah, so the idea is run prefill on there, run decode on the on the MacBook. This thing has twice the memory bandwidth the MacBook has 546 GB per second. This has 273 GB per second of memory bandwidth.
Whereas this has four times more compute.
Um so, like the ratio that difference in the ratio is like 12x.
Um so, the idea is basically uh you run like for really large prompts, uh you get actually get a big speed up by, you know, not just running it on your MacBook, but also splitting it up across here. But you don't want to run the decode on here cuz it's slower for memory bandwidth.
So, basically uh yeah, you you split up these two phases.
And then there's like some complexity to like streaming your KV cache cuz now the KV cache only exists on here.
Um so you need to somehow get it to the MacBook so it can do its decode, right?
Cuz the KV cache reads it every time it does a pass through the model. So, the way we do that is uh they're connected by 10 gigabit Ethernet.
I don't know if you can see, but uh 10 gigabit Ethernet. The MacBook doesn't actually have an Ethernet port, so we use a 10 gigabit Ethernet adapter.
And uh so it's a bit of an annoying actually because ideally like you can just have some USB-C cable or something, connect them with that. We don't have that working yet, but well, we kind of have it working, but it's not really like production ready cuz you have to run a bunch of scripts to get that working.
Um but uh So, why is it going over Wi-Fi?
Yeah.
Okay.
Uh is it should we skip that skip this?
>> we Okay. You can try a little bit longer.
We have maybe 5 minutes and then uh yeah, the issue is like um for some reason it's sending the KV cache over Wi-Fi at the moment, which then the bandwidth becomes a bottleneck.
So, what you need That's why we need 10 gigabit Ethernet cuz you're sending this KV cache over. You don't want that to like uh block, you know, you being able to compute the decode phase, right? So, you need to basically overlap computation and communication and like stream the KV cache over to the MacBook.
And it needs to be fast enough so that you can fully overlap. If it's not fast enough, then you'll be sequential. So, you'd like do your pre-fill, then it will still be sending, and then it would do the decode. Which is um which is going to be slow.
Uh can I have the HDMI back?
>> Yep.
Yeah, I guess we're almost Well, we're out of time now. So, uh I will close off with one thing, which is okay, like I'm telling you all this stuff, but why is this not already kind of more understood or more known? Well, basically, like the best source at the moment for this stuff is like Reddit or Twitter, and everyone says different things. So, like you know, I told I mentioned before like, you know, there's maybe a lot of threads that you see or like people that run experiments. Now, you have all these citizen scientists as well, which I think is cool because, you know, you have really capable AI tools, so people can quickly experiment and try things. But then, the flip side of that is like there's a lot of noise. And uh if you don't truly understand what's going on, then you might actually think you have a result.
Um and like the LLM is telling you, "Oh, you've made a breakthrough."
Um but actually, it's, you know, it's it's not so interesting, and it's it's not really usable. Good example of this is like people heavily quantizing models. So, if you like quantize a model to one bit, then you're better off just using a smaller model and not quantizing it. Um so, it's not like these these models are not very useful, right, at one bit. So, uh you might see some threads, for example, about, "Oh, I ran Kimmy on a MacBook or something." But, you know, it's like the one-bit version, and maybe they also pruned it or like, you know, instead of it activating eight experts of the model, it's activating two experts of the model or something, you know? And like what we want to do is actually like bring some transparency around this. So, we have loads of hardware. Um Yeah, we we we have probably the most like hardware specifically for this purpose.
And, you know, our idea is we want to publish benchmarks in the open that show you, okay, what performance will I get on certain hardware.
And um this stuff is changing really fast as well, right? So, like I said, I think there's a 100x, you know, in there. So, this will also be a source to be able to track progress, right? So, you'll be able to see like, okay, you know, how is how are things getting better? Is the software improving? Is the hardware improving? You know, are the models improving? So, this will be very soon, uh you know, let's say within the next month we'll come out with a website that basically has thousands of benchmarks on there. We're already continuously we're we're getting the data now. So, we're continuously running these benchmarks, different models, different quantizations, different ways of pruning the models.
We also we want to pair this not just with like raw performance of like tokens per second and prefill time, but like also the quality of the model, right? So, for example, intelligence per joule, right?
Is one we're looking at looking at that.
Um so, basically you'll be able to select a budget, let's say $10,000, and see like Pareto frontier of all the local uh setups at that budget. And um it will show you like, you know, if you want really good quality, you might have to like compromise a bit on the performance. So, if you want like to run GLM 5.1, then maybe you're going to get 20 tokens per second. Um but maybe, you know, for you it's fine to use a smaller model, in which case maybe you use Gemma and you get like 100 tokens per second.
And all of these kind of exist on different points on this Pareto frontier.
Um so, yeah, I I guess I I'll close with that. Um Um, unless we have this or Okay.
Uh, yeah.
Um, yeah, I think I don't know if I think we're basically at time, so I'm not sure if we have any time for more questions, but yeah, thank you so much.
Thanks.
Uh, can I ask how we are for time if anyone does any have have any quick questions? Or should we close it?
Maybe one question if anyone has one.
Or not.
Yeah.
Um, so, Simon Woods recently about somebody who's uh, Andrej Karpathy's auto research loop in order to optimize the model for a particular specified hardware.
Mac with 48 gig like those 11-inch MacBook Pros at each point off the off the disk. Have you done anything to optimize for your hardware setups to to to make the models Yeah. match match what Yeah. what you've got.
Yeah. I I guess I'm on the more cynical side with that work. Uh, I have seen a lot of stuff out there that is kind of again like cherry-picked and I think especially if you're using auto research without really understanding what's going on, you might it might not even be that you're trying to like hype this up, but you actually believe that you've made some breakthrough.
Um, so like I'll give you an example. I've seen like stuff where it is just a one-bit model running and they say like, "Oh, look, we have you know, Kimmy running from disk um, at And it's not even that good performance, right? It's like five tokens per second or something, but you know, and then also you have all There's many layers to it, right? So, like auto research might make a change that really impacts the performance like the quality of the model, and you don't realize it cuz you think it tells you like, "Oh, look, it's running." So, I think auto research really powerful tool, but my my opinion on this is you still need to follow the scientific method. So, you need to have like a well-reasoned hypothesis hypothesis. You need to then run experiments to test it. And then you know, iterate basically. So, like if it's just like, "Oh, hey, auto research, go off and figure out how to make this fast." It won't work. I think that's a pretty much a data is slot machine at that point. It's a slot machine. It's like And it might be addicted addicting, and you might think like, "Oh, wow, I've I'm making real progress here." But I think actually the set of things it can come up with with slot machine approach is very small. And it will be very sparse as well. Maybe eventually Maybe it will come up with certain things. Um you know, I if enough people are running it and enough compute you throw at it, but it's like brute force. It's basically like brute force approach.
Um So, like I really like auto research and I using it as well. Like I think it's a uh really powerful tool, but it needs to be used in the context of scientific method where you actually understand what's going on.
So, yeah, but then it's specifically on the disk thing, I just don't think it's interesting. I think um Obviously, the idea of like having memory hierarchies is is is interesting.
That That's just like a fundamental thing about um compute, I guess, but um specifically disk, it's just too slow.
It it it there are just like fundamental constraints there. You're better off focusing on the memory problem. Like that's going to be much more cost-effective.
Uh unless you do some like crazy setup where for for some reason, you know, stacking a bunch of SSDs is like way cheaper than um than using normal memory, right?
Which might maybe that would Yeah, like with memory prices and stuff, maybe that would become interesting, but like it's going to be it's not going to be a MacBook, right? Because MacBook has one disk and it's not that fast. Um but like I have seen some people looking at the problem of like, "Oh, what if we put like 64 SSDs in a box and you know yeah, with RAID or something and uh and then start to use these techniques where you can be smart about loading in experts, you know, you can there's a lot of research on like predicting which expert is coming next, for example, so you can like optimistically load it and then you know, maybe you're right 90% of the time, wrong 10% of the time, then you have to load it again and it's like yeah, you can do a lot of clever tricks there, but like for me like okay, like do those tricks, but do it with normal memory. Don't like why are you why are you looking at, you know, doing this with disk?
Okay, thank you.
And Alex in the background it worked.
Oh.
Okay, we It did work. Wait, do you want to show it then? Yeah. Like so so basically uh we can show like before and after, right? Can we show before and after? So so one is like the MacBook running on its own.
Um and you can see like this is a pretty large prompt, so it's a paper 38 kilobyte or like well, two of them, so Uh I'll just do the after first, but Okay, doing the after first. So, the after is using both together, prefill decode. So, it will do the prefill on the Spark and the decode on the MacBook.
And like one thing to keep in mind here, this doesn't make sense if you're doing a small prompt. If you're just saying hello, it does no speed up. But for like large prompts, where the prefill time also goes like quadratically with the size of the prompt, right? For most like architectures of model. So, if you're doing really big prompts, it's going to actually take a significant amount of time of the inference.
So, in this case, yeah. So, there's I don't know, this is like 100 kilobyte prompt or something. Uh yeah, it's a lot of Is it working? No.
Yeah. Yeah. Okay. So, basically, what it's doing is it's Now, you can see the utilization here.
It's like 100% only on the Spark, cuz it's running the prefill there, and it's streaming the KV cache over to the MacBook in a way that's like overlapping.
So, I think I got to Oh, well. Okay, Wi-Fi, I guess. Okay.
Should we try one more?
Okay. Never mind. You get the idea anyway.
Uh So, yeah. Like in this in this specific example, I think it's about 2x faster, is it?
Uh yes, it's around 2x. Yeah, so you get like on the end-to-end time of, you know, doing this whole thing, the prefill and the decode stage is about 2x faster um than you know, just running it on the MacBook versus running it on both.
So, yeah. And then again, you can imagine like the next thing here is like, okay, instead of a Spark, you know, what about just doing this with just a GPU, right?
So, you could do this with a RTX 5090, which is cheaper, and it also has a lot more memory bandwidth and to file compute. So, then you can start doing more interesting things.
Um where you split parts of the model.
Is it working? Uh it's the same model.
Okay, that's that's single node? Yep.
Well, I don't know if we need to How long did that take?
Uh like uh 7 seconds for one paper.
Okay, 7 seconds for one paper with the Does the Mac work?
And then We're We're quite over uh Let's just try one more time. If it doesn't work, it's fine.
That's it? Uh no.
Oh.
Yeah, so now the same paper.
Now it should be running across both. Does it have to do one more?
Okay.
Yeah. So in that case, 4.8.
So again, this gets better as you increase size of the prompt.
Um Oh, cool.
Okay.
I know we're over, so I'll stop there.
Thank you.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











