This setup brilliantly demonstrates that the future of AI inference lies in architectural synergy rather than raw power. By matching specific LLM phases to specialized silicon, Ziskind proves that memory bandwidth is the ultimate gatekeeper of performance.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
I Plugged a DGX Spark and Mac Together... and Didn’t Expect ThisAdded:
The DJX Spark is incredible at processing your prompt, but considerably slower at generating tokens. The Mac Mini is the opposite. Slow to process your prompt, but fast at streaming the response. What if you could combine the best of both worlds, and that's what I tried. But just because you can doesn't mean you should. All right, here's what I got so far. Here's my setup. On one side, I have the MSI Edge Expert.
Basically, it's a GB10, just like the DJX Spark, just from a different vendor.
It's got Nvidia's Blackwell GPU with 128 gigs of unified memory. On the other side, a Mac Mini with M4 Pro, 64 gigs of unified memory. And of course, we'll try the Spark in a Mac Studio later on. When you run a large language model, there are two phases. Prefill, that's processing the entire prompt, or PP as I like to call it sometimes.
I don't know why I'm pointing. It's It's weird. It's like ah, nobody's there.
It's not rude, right? It's just the camera. So, the prompt processing is compute heavy. That means a lot of that number crunching is going on on the GPU and GPUs are great at that. The next stage is decode which is token generation one by one and this is memory bandwidth heavy and Apple silicon is great at that. Now typically we do this on one machine when we're running it on our desk but when you split it up it's called disagregated prefill and decode and it's not just some academic concept.
Companies like Deepseek and Bite Dance and a whole bunch of other ones already do this in production. Splitting these two phases lets you optimize each one independently, and it's one of those reasons inference costs have been going down. I've been watching the Exo Project tease disegregated pre-fill and decode for consumer hardware on Twitter for months now. They even put together a nice little blog post here describing the thing with nice animations, but they never actually released it. So, obviously, I have all these machines sitting around on my desk. I tried making it work myself, and it drove me nuts for months. I'm not a systems programmer. I'm a web developer. I don't know anything about Rust networking code or lib P2P multiccast protocols. That's not me. I just happen to have the hardware here and the capacity to test what's already out there, but I'm not smart enough to create this stuff from scratch. Then the community opened a couple of pull requests adding Blackwell support for pre-filled decode disagregation. Now, it's experimental, untested on real hardware, maybe, I don't know, but the code was there. So, of course, I pointed claude code at it at the pull request and gave it full SSH access to both of my machines and said, "Make it so number one." But of course, all these tools are really expensive.
So, quick sponsor break and then we'll be right back. So, these days I'm always flipping between models. GPD for research, cloud for coding, nanobanana for image generation, VO cling, and runway for video. Six tabs, six bills, and counting. Enter chat LLM teams. One dashboard houses every top LLM and route Olympics to write one. GPT Mini for ultra fast answers. Cloud Sonnet for coding. Gemini Pro for massive context.
They recently added Gemini 3 and GPT 5.1 the moment they dropped. Create professional presentations with graphs, charts, and deep research detailed content. Need human sounding copy?
Humanize rewrites text to defeat AI detectors. Need visuals? Pick frontier or open- source models. Nano Banana Midjourney Flux for images. Magnific upscaling, plus VO, WAN, and Sora for video, all built in. You also get Abaca's AI deep agent to pretty much do anything. Build full stack apps, websites, reports with just text prompts and deploy them on the spot. They have Abaca's AI desktop, which is the brand new coding editor and assistant that lets you vibe code and build productionready apps. And the kicker, it's just $10 a month, less than one premium model. Head over to chatlm.abacus.ai AI or click the link below to level up with Chat LLM teams.
All right, and off it went to the races.
SSH into both machines, starting setting everything up, installed UV, cloned the exor repo, built the Rust networking bindings, compiled VLM from source on ARM Linux. That took a while. CUDA kernels for Blackwell aren't small. On the Mac side, it built MLX from source with all the metal shader compilations, installed Node.js, and off we went.
Easy, right? Well, it only took a few days.
Initially, after about 30 minutes of compilation, both Exo instances came up.
The dashboards loaded fine. First signs of life were there. Then came the hard part.
See, Exo uses MDNS to discover peers on the network, and these two machines could just not see each other. I spent hours with Claw trying everything.
direct Ethernet cable, USB adapters, modifying the Rust networking layer, a Thunderbolt cable, and a good quality one, and nothing worked. Well, eventually it did, but I had to get some more hardware involved. Got to push it to the limit. Then I ran TCP dump and found the real problem. Apparently, lib P2P's MDNS is broken on Mac OS. And the fix was simple. You just have the GB10 machine, the Spark, dial the Mac Mini instead of waiting to be discovered. So, I added an environmental variable, set it on the MSI Edge Expert, and the connection came up instantly.
Finally, with the cluster connected, I loaded the models. Started with Quen 3.527B, nice, decently sized model for these machines because on the GB10, I ran it in BF-16. That's the Blackwell optimized attention back end. And we had to have the exact same model on the Mac Mini, but in 4bit using MLX, cuz that's what's optimized on Apple Silicon. Different quantizations on purpose. The GB10 has 128 gigs of memory, so it can run full precision. And full precision is faster for prefill compute. The Mac Mini has the 4-bit version and is smaller. So, the smaller model means faster decode.
Each machine runs its optimized code and for what it's optimized for, the role that it's optimized for. And of course, it wasn't that simple. There were more bugs. Maybe that's why this PR is still open. But it was a good start. The prefill routing was broken now. The Mac Mini's runner process couldn't reach the GB10. And on and on like this we went.
Each fix uncovered the next problem. But eventually after a reboot and some creative workarounds, everything connected.
So I sent a long prompt to the Mac Mini and watched it route the prefill to the GB10. The Blackwell GPU chewed through the tokens and sent the KV cache back.
It worked. The GB10 prefilled at 546 to 937 tokens per second depending on the prompt length. The Mac Mini locally 66 tokens per second up to 14 times faster on the GPU. But then I looked at the actual end to end time and something was off. When I broke down where the time was going, the answer was obvious.
Network transfer. You probably could have guessed that, right? At 25,000 tokens, the GB10 computed the KV cache in under a second. But transferring it over my 2.5 GB USB Ethernet adapter took 25 seconds. In other words, 96% of the total time was just the network. The GPU was still idling, waiting. I also ran a three-way comparison. GB10 alone versus the Mac Mini alone versus the disagregated. And yeah, here I was using Quen 3.5, which was a thinking model.
So, it generates hidden reasoning tokens before the visible response. Those reasoning tokens run at decode speed the same on all platforms and they dominated the time to the first token. All three configs looked almost the same. To really show the difference, I needed two things. A faster network and a non-thinking model.
So, uh, a couple months ago, just for this very reason, in fact, but I haven't shown this yet on the channel, I bought a Thunderbolt enclosure. It's an external enclosure, Thunderbolt 5 with a PCIe slot, and it looks like this. I got a nick in there. Now, first I started off with this network card cuz I thought, hey, I'm going to put my most powerful network card in there. This is an Intel 810 Nick, and it's a dual QSFP port connection at 100 GBs. So, yeah, this is a nice fast network card. But when I put this in and plugged into the Mac, Mac OS said, "Nope, sorry, driver not installed. Dead end." Then I dug into my drawer of network cards and I found ah I found a bunch of these and ah don't stick your finger in the fan. Rule of thumb. Rule of all fingers.
Stupid jokes. All right. I bought these cards when I was experimenting with my framework cluster. You might have seen that video. But this one right here happens to be a Melanox Connect X4. Also QSFP port. So, it's convenient. And it happens to be 50 gigs. So, not 100, but still. I plugged it in and Mac OS recognized it immediately. See, Apple has already shipped with the built-in driver for Melanox cards since 2019. I thought about trying this one, but this is only 25. So, if that one worked, hey, that's what we're going with. Now, on the other side with a QSFP connection, I have the Microex CSR 812 switch, which I shown off in my DJX Spark cluster video.
Getting the switched port to negotiate took some work, but once it linked up, KB cache transfer improved by about 30%.
I imagine it could probably go even faster if we got a 100 gig card to work.
Now, the non-thinking model, I switched to Llama 3.18B.
Yeah, I know, I know it's an oldie, but a goodie and it's dense and it works on everything. It's a good experiment model. Okay. And I use Llama Beni, this tool right here, to measure prompt processing and token generation properly. So it's going over the whole HTTP stack and responding to an API call. So we got three configurations, GB10 alone and BF-16. By the way, along the bottom, I have different prompt processing lengths. So you can see that PP 2048 and PP 4096 have the best throughput up to almost 1,800 tokens per second there. Then I ran Mac Mini alone in 4bit. quite a bit lower there, but token generation down here is much faster as you can see. And finally, disagregated GB10 prefill plus MAC decode. Hm. Okay. Okay. We're getting somewhere. Where are we getting? I I don't know, but we're getting somewhere.
Especially when you don't look at these individually, cuz you can say, "Oh, what's the point of this?" If you look at PP 4096, for example, you just say, "Oh, just run it on the Spark, right?"
Well, then you take a look at the token generation and see that it's actually faster and disagregated. We're getting somewhere. We're getting somewhere. Now, time to first token. If we take a look at 4096 tokens, we're almost at 2.4 seconds. So, the 50 GB link adds almost zero overhead for comparing GB10 alone versus disagregated. So, the disagregated setup gets GB10 class time to first token with MAC class decode.
And you can see that neither machine actually achieves this alone. So, is it worth it? And here's my honest answer.
My disagregated time to first token of 2.4 seconds matches the GB10 alone at 2.3. I didn't beat the GB10 on preo. I matched it. And disagregated decode of 34 tokens per second is actually a bit slower than the Mac Mini by itself at 52. And that's from the overhead of injecting the remote KV cache. And if I'm being really honest, a single RTX Pro 6000, the workstation Blackwell card, would probably demolish the entire two machine setup on both prefill and decode. This thing has six times the memory bandwidth of the GB10 and 3 and 1/2 times the compute. So, was it all just a cool experiment? Well, maybe. But there is one variable I haven't changed yet. See, the Mac Mini M4 Pro has 273 GB per second of memory bandwidth. That's what determines the decode speed. But there's a machine on my desk, well close to me, that has 819 gigabytes per second, three times more.
Oops. The M3 Ultra Mac Studio. If I swapped the Mac Mini for the Mac Studio, decode could jump from 34 tokens per second to over 100. And combined with the DJX Sparks's 1,700 tokens per second of prefill, that's a setup that might actually be worth building.
>> Excuse me, would you pass the tokens?
>> Only if you've like got the cash.
>> So, on my desk now, a DGX Spark and an Apple Mac Studio M3 Ultra. This one has 512 gigs of unified memory. Same exo code, same switch, same connect X4 and Connect X7 cards. Only the silicon changed. I got it running, but of course, as usual, the setup was a pain.
The Mac Studios runner subprocess couldn't reach the Spark, so I recreated the Connect X4 as a proper network service and rebooted. Same ritual I've already done before. This time, it only took a couple hours. So, I know we have more capacity here, but I thought we'd start with the same Llama 3.18B so we can compare apples to apples. Yes. Yes, that phrase actually works. Now, I've been waiting for so long for it to work.
Remember, the Mac Mini decoded at 52 tokens per second. The Mac Studio 106, so it's just about double. Not the full 3x that the bandwidth spec would have us believe, but pretty close. The extra bandwidth isn't fully saturated at batch size one, I guess. Prefill at 4K or 4096 tokens. Spark alone got 1585 tokens per second. Max Studio alone got 1420. Not bad for Mac Studio actually.
Disagregated 1584. The DSAG column matches the Spark almost exactly. And the 50 GB link that we have going on here adds about 18 milliseconds of overhead and disagregated decode 84 tokens per second here. That's down from 106 on the Mac Studio because of the KV cache injection overhead, but still six times faster than the Spark alone at 14.
So remember, we got 34 tokens per second here in the first part where we did it with the Mac Mini. So this is 2 and 1/2 times better. So the bandwidth hypothesis actually held at 8B, but the Mac Studio has a lot more RAM than the Mac Mini. We shouldn't let that sit idle. Let's kick it up a notch. Llama 3.1 70B, everybody's favorite 70B model.
It's an oldie but a goodie. It's a dense model. The Mac Studio runs this MLX 4bit quant, which is about 40 GB on disk, so lots of room to spare. Now, the Spark is the constraint here at this point. The full precision 70B model is 140 gigs and the Spark only has 128, so it's not going to fit. So, I needed to find a quantized variant that the Spark would run. FP8 is what I tried and that one fails. Reason basically had to do with the compilation of the cutless matal kernel and in the VLM version that we compiled we didn't use that. So FP8 is out. I tried the AWQ or activation aware quantization int4 version that failed also. When it was running it was autoconverted to AWQ Marlin and the Marlin repackag kernel is missing. I also tried this one W4A16.
Same story. This is the GPTQ Marlin.
Same missing kernel. So the VLM wheel for Spark only ships BF-16 and FB16 kernels. None of the quantization got built for SM121, which is the Spark. So I'd have to completely rebuild VLM from source with the kernels and that would be a pain. But not to worry, even though 70B is off the table without a VLM rebuild, we still have a bunch of models we can use larger than 8B and enough for the Spark and the Mac Studio. We have 32 billion and 27 billion class models. And at BF16, they fit on the Spark just comfortably. Quen 2.5, 32B, BF16 on the Spark and 4bit on the Mac Studio. At this point, I'm running prefill at 4K.
Spark alone gets 875 tokens per second on this one. Max Studio alone gets 356 and DAG 792. So, we're kind of seeing the same pattern as Llama 8B here where desegregated tracks the spark and that's a good sign. That's better than two times the Mac Studios prefill speed.
Then on the decode side of the story, Max Studio of course kicks butt here at 29. Spark 23. So only 1.3 times gap at this size. And that's down from 8 times the gap we saw with the 8 billion parameter model. Now hold on to that.
The next model tells the same story. I went with Gemma 227B again BF16 on the Spark and the 4-bit MLX version on the Mac Studio. Very similar looking chart, right? But slightly different numbers.
The Spark gets 779 on prefill, Max Studio 379, and DAG 722. Same shape as before. I call this the upside down middle finger. I probably shouldn't call it that, but that's what it looks like.
On the decode side, we've got Mac Studio at 30, Spark at 24, and that's only one and a quarter time gap here. So, two different architectures, Gemma and Quen.
We're seeing the same behavior. And I had the same question. What happened to the max 8 times decode lead from 8B? And by the way, uh I see that there's 24 here and 24 here. Before anyone asks, yes, I checked disagregated here. Looks like it may have skipped the MAC altogether here, but it didn't. Every model I tested lost about 20% of decode speed to KV cache injection overhead.
So, 30 minus 20% is 24. It just happens to match the spark number exactly by coincidence. So in this particular case, if we ran it on the spark, we probably would have gotten the same numbers. See at 8 billion decode is almost purely bandwidth bound. The max three times bandwidth just wins. At 27 and 32 billion though, two things shift around.
For Gemma, sliding window attention caps how much KV cache decode has to read per token. So bandwidth stops being the bottleneck. for Quen VLM kernel fusion and torch compilation on the spark side dramatically cut the bandwidth demand per decode step. So we got slightly different mechanisms, different outcome.
The sparks decode gets relatively better and the max bandwidth advantage stops mattering as much. All right. Now what do these charts look like to you? Huh?
We've got three models, three sizes, three architectures. What do they all have in common? Well, in every case, the disagregated time to first token tracks the spark. Llama 8B 2.6 seconds. Gemma 27B 5.7 and Quen 32B 5.2 seconds. So the Spark class prefill is always recovered.
And the prefill advantage over the Mac Studio grows with model size. It's pretty clear here. You can see that at 8B the Spark is barely ahead. 1420 versus 1585. At 27 and 32B the Spark is 2 to 2 1/2 times faster. So disagregation becomes more valuable at larger sizes, not the smaller sizes. Now decode is where it gets interesting. At 8 billion, the Mac Studio is eight times faster than the spark on decode. That's huge. That's the bandwidth story playing out cleanly. At 27 and 32 billion, the gap kind of shrinks to 1 and a/4 to 1 and a3. So the decode side of disagregation gets cheaper at larger sizes. Not because the Mac gets worse, but because the spark gets relatively better. So, does it work? Yeah, it works. Two machines doing what they're good at. Talking over a 50 gig link and spitting out tokens faster than either one could alone. That's the whole pitch and it finally delivers. So, look, as a proof of concept for heterogeneous inference, this is really cool. And if you already own both of these machines, great. Go squeeze a little more juice out of that orange and Apple.
But realistically, the DJX Spark and the Mac Studio are not cheap. If you're spending that kind of money on new desktop gear, I'd honestly just rather get the less portable but much more powerful RTX Pro 6000 and build out a rig around that. In fact, I did a whole video comparing that to the Mac Studio right over here. Thanks for watching and I'll see you next time.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











