Witteveen effectively demonstrates that AMD’s ROCm has matured enough to challenge the CUDA monopoly, offering a viable path for local AI sovereignty. This shift marks a crucial step toward a more competitive and accessible hardware ecosystem for independent developers.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Running Local AI on AMDAdded:
Okay, so I think it's becoming very clear that local AI is a huge part of the AI future. Open weight models have closed the gap to the frontier models from being really far behind to being somewhere between sort of three to six months off the cutting edge. That gap is shrinking, not growing. Meanwhile, if you look at the frontier labs, tokens look like they're getting cheaper on paper, but practically people's bills are going up because agents and reasoning eat tokens at a completely different scale than chat ever did. And then the other thing that's becoming totally clear is that people really want to have privacy and control over the AI that they're using. And so while there are these amazing frontier models that are coming, the big question seems to be is who's going to be able to afford to use them at scale. And more and more local AI is just starting to make more sense for a lot of the workloads out there. So the question I've been looking at more and more lately is what actual hardware stacks can get you there. And in this video I want to take a look at the AMD side of that. So I've teamed up with Zidex and AMD to get access to one of their workstations with a Threadripper 9980X and an AMD Radeon AI Pro R9 700 GPU with 32 gig of VRAM. I got to say at the start the results I'm seeing are really impressive and have actually surprised me on what can actually be done now with this kind of system. All right, so let me unpack again why local.
The open weight models that we're seeing now, the Qwen 3.6, the Gemma, the Kimmy, the Deep Seek models, etc. These aren't toys. You can do real work with them.
I've been covering many of them on my channel and if you've been following the open space at all, you can feel that this trend is not slowing down. While it's clear that the top three frontier labs do have an advantage, Open models have caught up and surpassed a lot of the second-tier labs that are still trying to position themselves at the frontier. But, the biggest factor that I'm seeing is tokens. And this falls into sort of two parts. The first part is that as these models get so good, people want to be able to run them on their own machine doing things for things like coding agents, for personal agents like Open Claw, like Hermes agent, etc. And those particular tasks because they're using reasoning and because they're using agentic calls, they are burning through tokens. If you're paying for a frontier lab model for that, you're going to find that you better have a decent budget. The days of the all-you-can-eat plans are are being dramatically scaled back so that they're only supporting their own particular coding agents, etc. And as we're seeing the next generation of models being talked about, it's pretty clear that those are going to cost a lot more than what the current top-tier models are going to cost. So, this combined with wanting to basically control your data means that you have to look at local AI solutions. All right. So, what I'm going to do in this video is I want to show you how the AMD system that I've been using is working and show you what's actually possible as we're sort of coming up to the middle of 2026. If you want to look at something like this to be your AI workhorse computer. All right. So, the hardware itself here is definitely impressive. Every component, both the GPU and the CPU, are AMD here.
So, the CPU that I'm using is a Ryzen Threadripper 9980X.
This is definitely a powerful core of the overall system. The GPU is the Radeon AI Pro uh 9700 with 32 gigs of VRAM. It's really interesting to see how this actually performs on the modern AI stack. So, we're going to look at some things like Ollama and LM Studio for running LLMs.
We'll also look at some image generation and stuff like that. But also I want to dive into code and showing you actually running things like the Transformers library, Unsloth for doing some fine-tuning, and even just running some straight-up PyTorch on this machine. So let's jump in and have a look at it.
Okay, so whenever I get a new GPU or a new system like this, the first thing I want to check out is the LLM stack. And generally that's going to start before I go into code with just simply Ollama and LM Studio. So that's one of the things I find people are running every day, and I find that you can kind of tell from the performance on that how things are actually going to go. So let's start with LM Studio. The setup is actually pretty smooth here. LM Studio now ships with a ROCm runtime. So you can point it at the runtime, restart, and it just sees your card. Now the cool thing with this is using the 32 GB card here, I really don't have to make huge compromises on the quantizations. In fact, all of the models that I'm installing, I'm literally just installing the quant that they basically recommend. Usually that's going to be 4-bit. If it's a smaller model, I can actually usually go for something that's 8-bit. Or if it's something like the Gemma 4 4B model, etc., I can go full resolution here. And you'll see that running this, like this is the Qwen 3.6 mixture of experts here, and I'm able to get very good token response rates out of this. It's able to return things back really quickly, and you can see that I'm averaging around about 160 tokens per second on this model, which is not only way faster than you can actually read, it's also a good speed for using with agents. So already this means that I've got access to a really cool model that has got full reasoning on and off abilities that I can basically turn the reasoning on or off. I can then use it for vision if I wanted to with this particular model. I can then even do things like bring in a document, have it load up, have it be able to set this up and go on the fly. And so I've got now the whole sort of chat with text going on. I'm getting out simple answers. And again, of course, the cool thing here is that this is running fully locally. And you can see at any point, if I want to, I can actually change the context window size in here. So, okay, you can see here I'm using around 64K, but I can come out and make that longer. And I'm still getting good speed from this model. And you can do all the same things in Ollama as well. In fact, lately with Ollama, they're also going down the road of Open Claw and stuff like that. And you can actually run this whole thing locally with one of the models that has been optimized to do those kind of agent tasks in here. So, for running LLMs out of the box, you're going to be able to run a lot of high-quality models the moment that you're finished downloading them in here. All right, so now it's probably a good time to explain a little bit of what is going on underneath LM Studio and Ollama. And I'm actually talking about the layer that they sit on. So, this is the Radeon Open Compute Platform, also just commonly known as ROCm. And honestly, this is the thing that 10 years ago when I was building a deep learning computer, made a lot of people kind of nervous in here. Back then, people would be seriously impressed by the AMD hardware, but the software compatibility was where the issue it was. Now, I'm very happy to report this is just not an issue today.
Pretty much all the deep learning frameworks support ROCm and allow you to basically run and train models, not just inference, but being able to train models as well, on these cards like the Radeon 9700.
In fact, as I was initially setting this up, LM Studio ran out of the box, Ollama ran out of the box, and even things like PyTorch has official ROCm wheels.
Meaning that you can just go to the install selector, pick ROCm, and then copy the pip command, and it installs.
And the cool thing is that once you've got that installed, your existing PyTorch code, etc., mostly just runs fine out of the box. Things like the Transformers library and a lot of the other common sort of frameworks that people use with PyTorch are now fully compatible with ROCm. And AMD has clearly put a lot of work into both ROCm and HIP, their translation layer, to basically make all of this work. And like I mentioned, this is not just for inference. You can actually do full sort of fine-tuning and even training from scratch on this. In fact, Unsloth themselves have put out a guide to using Unsloth for fine-tuning LLMs on AMD GPUs. And generally, you'll see as long as you're using that ROCm-optimized version of PyTorch, you're not going to really run into a lot of issues for doing standard kind of workloads. This is going to work pretty much just straight out of the box. And the cool thing is if you do need to go deeper into ROCm, they've got a whole bunch of documentation in here, and you can see that just looking at the date, this is being updated quite often, and it's clearly a priority for AMD to make these GPUs work for the AI workloads people are doing. All right, another thing that people want to do often is to basically make generative images and generative videos. So, probably the best way to do this nowadays is with ComfyUI.
And if you actually come into install ComfyUI, you can actually choose a ROCm version there, which will allow you to run this with a GPU, etc., that I've got going on here. You can see that very quickly, I can start to generate a bunch of different images to try out different things here. I can even take something like this. I can see, okay, I've got this nice picture of a cat. I want to see how much just changing the seed is going to affect this. And then I can just generate again, and you can see that it's generating pretty quickly for me to be able to actually see the difference. So, we can see that just changing the seed really hasn't changed the cat too much, but it's changed the angle that the cat is actually on there.
So, this is not just for doing images here. You can pick a bunch of different things in here. You can basically go for a text-to-image kinds of things. You can also do image-to-image stuff. You can go for models that generate video in here.
This is something that you can certainly try out. I can see a lot of these models are quite well known already of things like the LTX2 models, the Waifu 2.2 models. I think the LTX 2.3 is in here somewhere as well. But, the cool thing with this is once I've downloaded the weights, I can actually just set it to generate a bunch of different examples to look and see, okay, what it is it that I actually want to generate out.
They've also got a limited number of audio models, and then image-to-3D models, etc. And of course, I can find other models to download. These are just some of the popular ones in here. So again, the AMD machine and the GPU are actually performing well here, allowing us to basically run ComfyUI, run decent-sized models. So, this is something that I'm finding the system actually works really well on here is being able to generate this kind of things. Okay, so for me, where ROCm really shines is when you actually install Linux on your machine. So, of course, you can have WSL on Windows to basically have Linux, but if you make a dual-boot system where you've actually got Linux, you can actually then get full support from things like PyTorch, etc. So, you can see here that basically I've got a Linux machine, and I've got ROCm 7.2 running here. And you'll see that I just can't do that with Windows.
When I select Windows, I can't basically select this. And you'll see that once I've got that installed, I can actually import torch and I can actually see that okay, I've got a full replacement for CUDA in here. So, you can see here okay, that this is available. I get the device name, which is the Radeon graphics and stuff like that. And then once I've got that set up, I've now given PyTorch access to the GPU. So, I can actually come along and you can see here I'm training a model just a simple CIFAR-10 model where it's fully running the GPU.
And you can see it's basically trained a simple ResNet model on CIFAR-10 there. And I've actually set it up so that we've got a Gradio interface here with a small ResNet. And you can see that if I click on one of these, I can see okay, what's the prediction there. I can shuffle some examples. And it looks like it's doing reasonably well at least on those examples. And you can see here that now I'm basically just printing out one of the files where I'm loading one of the Gemma 4 models, running it with transformers locally.
And if I want to run the actual Gemma model with a Gradio interface here, you can see it's basically got the device where it's actually running. It's loaded my model. And then I can just chat to it in here. And you can see that sure enough, it's going through generating very quickly with the transformers library in here. It's set to a limited number of tokens out. But you can see that okay, this is generating quite well here. We can actually change the tokens out. So, this is just doing it with the purely transformers library. If we wanted to, we could also do it with vLLM and serve things out of there. So, the cool thing about having the Linux environment here is that we've got multiple ways that we could do this now.
We don't have to just do it in something like LM Studio. This This using the full resolution version of the Gemma 4 model here. And we can use the Transformers library to run this for our own agents, etc. And of course, the cool thing then is this is running fully locally here.
So, just to finish up this setup and this GPU has really taken everything that I've thrown at it and everything's been able to work really well. It really does show how ROCm plus the hardware have come a long way in enabling you to run deep learning workloads and AI workloads, whether that's LLMs, whether that's generative images or videos, etc. Let me know in the comments if you've looked at these or even perhaps some of the higher-end AMD GPUs. As we're moving to a world where I really think you want to have some kind of local AI solution, even if you are still using APIs and frontier models. This is a way where you can run a whole bunch of different types of AI apps and be able to get good results. So, let me know in the comments if you've had a look at this. As always, if you like the video, please click like and subscribe and I will talk to you in the next video. Bye for now.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
5 Mind Blowing Omni Uses Cases
PaulJLipsky
1K views•2026-06-02
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29











