Install our extension to search inside any video instantly.

New Google Gemma 4 on old V100 Server Cards (it's slow)
Added: 2026-05-05

109 views816:27LinuxMan86Original Release: 2026-04-21

This demonstration proves that running massive models on aging hardware is more an exercise in patience than a viable solution for real-world tasks. It highlights the brutal reality that for local LLMs, hardware limitations remain the ultimate bottleneck to actual productivity.

[00:00:00]So, the new Google Gemma stuff is out now, and it still works on my V100 cards.

[00:00:08]Compatibility not that big of an issue yet. Currently, I'm running Gemma 431B Q80.

[00:00:18]Those are things. Basically, it's a lesser quantized version and it's running 8 bit, even though mine really should only be supporting about 16 bit.

[00:00:28]So, there's a lot of slowness happening here. I do have it running on my server, but I don't use LM link because it's really troubling. It shouldn't need an account of any sort to be able to contact something that's on my local network. I just I don't like where that sort of thing is heading. So, what I'm doing today is I'm going to show you the procedure to update through a console using SSH, which is skippable. Then I'm going to show how the full fat version works that I'm running. Then I'll show you how to install different versions with extra little arguments for different different versions and then just how I feel about Gemma 4. Now I'm not going to tell you how to run SSH.

[00:01:11]That's kind of outside of the scope of this video. That's stuff that you can look up elsewhere. The tutorials that we're looking at today are from LM Studio AI docs. First, you're going to want to do LMS Damon down, which because we don't have like a fancy little X to click on in a window, this shuts it down. Then we'll run lms damon update.

[00:01:29]Then we'll run lms damon up. But that's not the whole thing. If you type in lms-h, you'll get a whole bunch of extra commands we can run. Specifically, we're going to look inside of runtime to update the runtime. Otherwise, you won't be able to download successfully. And if you do manage to download successfully, which I was able to brute force my way through, it still won't run. And if we do LMS runtime, we get even more commands. And update is the one that we're going to want to run. Looking through the history of my use of runtime shows that I also specifically ran an update to update my CUDA stuff. So, if yours still doesn't work, this may be the sort of thing you'll want to look at. Loading up the model, just seeing the amount of memory that this big boy takes, that's just a lot. Looking at the version that I have, that's 33 to 34 GB.

[00:02:18]Once fully loaded though, it takes 36 GB. However, just to let you know, you do need more than 36 GB to be able to run these sorts of things because this is only just the model. The chat, on the other hand, does take up more LMS chat.

[00:02:32]And as you can see here, the model is already loaded for us. Now, if you know me from my main channel, you'll know that I do a lot of coding stuff, but I don't use LLMs to generate code cuz if I don't know what it is that I'm looking at, I'm not going to be able to pick up all of the bugs that they make. So, what I've got here is just a request for it to teach me how to make a simple bash script, which is something I'm trying to learn. The time to first token does take quite a while. And then the inference itself is not really that fast. I have seen much faster ones, though I have also seen slower ones. It's doing okay.

[00:03:05]It's a hell of a lot faster than what I can read at though. Also, at the moment, this is during the reasoning phase.

[00:03:14]And now we're getting to the actual inference itself. And as expected, it's about the same. You're just generally getting an idea of how this sort of thing would run on a V100.

[00:03:26]And I have my result. Now, unlike the gooey version, I don't have an automatic readout of how many tokens per second we were getting. I'm feeling it's not that high. I can just go in and I don't have to type anything. I can just go up and down with the keyboard and then put in stats. That's 4.75 tokens a second with a time to first token of 2.1 seconds.

[00:03:46]Yeah. So, now we're talking about getting [snorts] very slow. However, for my purposes, this is actually not too bad. Let's kick it up a notch. I've asked the LLM to come up with some tricky questions. And now, let's run three of them at the same time. And just looking at it on the surface, they don't seem to be moving any real measurable value of slower. We're going to have to compare it to the results from before.

[00:04:10]As for the power usage, this is not super accurate. I feel that the polling rate is just not doing what I would hope and the power fluctuations seem to be really massive, but it doesn't seem terrible. A lot of the times these numbers are quite low. Now, if you're wondering about FP8 and 8bit stuff, that's to do with compression stuff and with how the inference itself is run. If I don't have FP8, but only have FP16, which is floating point 16, well, it takes the 8 bit and then re-encodes it, it's a lot slower this way. If all the training and whatnot was done in FP16 for FP16, this would be a lot faster.

[00:04:51]Oh, damn. Okay, that took a really long time. And we reached the end of our token limit. And unfortunately, if I try to look up stats, it has no previous generation stats available. H well let's run it again except not really tricky questions. Time to first token is not really any different seemingly and the inference is not any slower or faster considering the difference in the difficulty of the questions. And now getting to the actual response stage and they seem to be doing okay. Wo. Now that one of them is finished. The other two are going a lot faster. Holy That's a lot faster.

[00:05:30]Okay. Well, when one ends, let's see what the other one does in terms of speed. Okay, now we're on to the final one. And look at it go. Okay, well, the difference between running three and running two is much greater than the difference between running two and running one. This is the one that ended first, so it'll have the slower stats.

[00:05:50]4.71 tokens a second, and I was able to keep up with what it was writing and kind of waiting for it. Now, that's significant because I'm dyslexic and reading is very slow. This one finished second, so it's going to have a little bit more of a very dancer with just that short amount of time. It went all the way up to 5.46. God, it's been so long.

[00:06:10]I don't remember what the original one did. And then the final one, 5.93. I'm just going to run one again on its own so I can remember what it was like. The time to first token is really long right now. I wonder what's happened. Well, it's going now. And it is fairly expedient. Yeah, 7.68. But the time to first token was incredibly long. Not sure what happened there. Now I'm going to have a look at this one that went into lower coding on a cursory check. It does use require. We have proper declaration of variables with all of these lines. Proper use of syntax when coming to tables. And then proper use for here kind of maybe except this doesn't need to be in square brackets and it doesn't need to be in quotes. It can just be enemy. This looks to be a selfiterating table iterator. We'll have a look at this in a second. For now, it loads into here. This is the read argument. I don't know about putting error in. And there doesn't seem to be any sort of return sort of situation. So once it goes past here, then it gets to actually passing it, which could be a problem if it doesn't actually exist. It does remember to close the file afterwards. Then it does use JSON decode, so that's good. It is also connecting the variable to the right part. Then we're sorting through it here. Checking a table type to remove things. So enemy is true to remove an obstacle is true to remove. We are comparing it to the type of table that it would assume to be in here. It looks to be an array and then tables inside.

[00:07:41]So the way in which it would do this, it would say that this would be the underscore, this would be item. Then it compares it to the table which was written up above and item. type. So if each individual line is item, then type would be this inside of it. That's why they've gottype there. Now I see and if it's not in this table of things to remove. Then table insert filtered items which is a newly created table with item which should be this whole entire line.

[00:08:09]That's done fairly well with going back and forth between languages. Then it does the XML section with the table toxml which is a function that they made up here. And this looks to be some sort of legend for what to do. And root. And now I'm remembering that I wrote the wrong thing. I didn't mean XML. I meant CSV. Garbage in, garbage out. Currently at [clears throat] the moment I'm working in about seven different languages or something like that. Like don't blame me for making mistakes. Then I suppose you could also say the same about LLMs. But at least I don't have to worry about its feelings. At least I can see that it's got concantation correct.

[00:08:46]It does seem to be sorting through things properly. If this is a table, then it'll go through itself again. So, it's doing a really nice job of being recursive. This K and V call which comes from items which is a predefined string and not any way related to this. And then filter data is a table. So, that table entry K is the predefined string and this is a table. But then it will keep going through here until it finds things that are not table entries. And it looks like it is adding to itself.

[00:09:15]nothing about adding new lines. So, this is going to be an entirely one string deal. And it also knows that V should probably be done with to string and K they wrote themselves doesn't need to be made to string because sometimes they're numbers and sometimes they are strings already. And just like they started with this, they have properly ended the XML like this. Then it returns the XML which is a whole string basically. It then comes back in as this being a whole line not super human readable but works. Then it finds the output location. It writes it and then closes it. However, if there was any sort of mistake, this is very presumptuous. But I got to say this is the first time I think I have ever seen anything written by an AI that I don't have to like instantly start poking holes in. And it only went through one iteration. I didn't have to send the agent off to do a thing and iterate over and over and over again. This is actually pretty clean. However, if I didn't know what I was looking at already, I probably would have missed the fact that I had made a mistake and then this would have given erroneous results. So, it still pays to know what it is that you're asking for before you ask for it. Basically, all you should really want is a shortcut. Unless you're a slot producer like Microsoft and you don't really care. It's more about expedience and getting out features as opposed to actually having things work properly. Really, the only thing that was questionable at all was the fact that they had this set up. This doesn't need to be anything, even a string. So close. Now, the final thing is to run different versions and different quantized levels to see how it runs on my inference server. When you're looking at the models, you've got Gemma 4 31B, got Gemma 4, E4B, 2B, 26B. I believe that is just in relation to the type of compression, not the floating point stuff, but they are kind of linked. However, there's more to it inside of here. Now, if we click on this drop-own menu, you see that there's three different versions. If we're in LMS chat, though, and go to download, you see that there's only the main four versions and not the four versions plus each one with their own iteration. So, I'm going to cancel out of this with C.

[00:11:31]And here you can see that I've already used this to be able to get the exact model that I wanted. This time we're going to go with E2B. And once again in LM Studio AI docs, we're going to have a look at specific quantizations in the 2B in which I'm going to try. There's Q4KM, Q6K, and Q8. I'm going to specify Q6 capital K. Q6 K. Then it'll start to download.

[00:12:00]I do find that this finalizing step does take quite a bit of time sometimes. Then normally what you would do is an LMS unload, but I want to see what happens if I try to load another model and try our E2B version. And we're loading inside of LMS chat this time. It's trying to load at the same time.

[00:12:17]Interesting. I've not tried this before.

[00:12:18]Oh, it also loaded a hell of a lot quicker, too. I'm going to ask it that lure question that I asked earlier, except this time with CSV instead. Its speed is actually fairly decent. The time to the first token wasn't bad either. Now we're into the inference step. Our power usage is very low on this particular model as well. As we can see, it's not really spiking above 20.

[00:12:39]As for the GPU utilization, it's not doing bad either. I always assumed that more quantization away from what the GPU is natively capable of would make it slower, but I suppose really what it is is just the size of the model is what makes it slower. And our stats are, oh, a few more tokens per second.

[00:12:55]Previously, we're getting about seven.

[00:12:57]Now, we're up to about 9.3. Our time to first token was a little bit slower than the average at least, except that one weird one. This doesn't seem so bad. And here it's telling us a few different versions of things that we could get.

[00:13:10]And instead of telling us how to install this like the last one here, it's just showing a straight up terminal command to install this thing. Then we have our variable set up correctly. Then we've got a loading function. It opens the file up. It can then chuck an error.

[00:13:25]Then wait, I've never seen that on file read. I've also never seen this written like this before. However, it is making sure that there is a table getting back anyway. So, at least there's some sort of error protection here in this version. Then down a little bit further and it looks like it's entering through every table section here, which is this block, this block, and this block. and saying the part inside of that block called status which would be these if it's equal to active like this one's active and this one's inactive it then inserts it into filter list then it returns the filter list then in writing the CSV logic the first thing it does is checks to make sure that this is any sort of length and if there isn't then it goes to return which would mean that it doesn't do the rest of this like create an output file bring in the file name make it write that's good if not output file file then error. Uh, that's not correct. IO open should in fact actually create the file when using W.

[00:14:30]However, I suppose if it's trying to do this on an unwritable drive or directory, maybe that's where the error could be. Then we go in and put the headers in. Then they do create new lines here. So that's good. Does some table entry stuff and creates another new line at the end of every line.

[00:14:47]However, it's doing right an awfully lot of times. This is probably not the efficient way in which I'd do this. I'd probably create a new table here somewhere like local table is equal to this. Then I would introduce it into a table every time with this. And then I would move this stuff outside of the loop. Like I yeah, I wouldn't be using the right function quite so much. But then it also closes it. So that's good.

[00:15:15]And we got the main execution apparently. All right. Concatenated properly with length with proper spacing as well. So it's human readable. Then filter data and then write CSV. And all of these functions are linked. This is a pretty good model. H. This is actually getting scary good. Now would I use it to blindly write things for me?

[00:15:37]Unlikely. It has had errors in the past.

[00:15:40]I have done one little test with it and it did have an error in there. I don't remember what the error was though, but this is doing pretty well. I do think in the future there will be a time where we don't really need to write code or anything. We can just completely vibe code ourselves. I don't think we're there yet. And the problem is that some people think that we are and they put out so much slop into the world that causes problems for everyone else. If you want to do vibe coding for your own little projects that you're not really sharing out into the world, that I think is fine. But if you start sharing it, that becomes a problem for other people.

[00:16:22]Anyway, I hope you've enjoyed this video. Next time we might go cheaper.

Related Videos

VALORANT's Latest 'Exclusive' Tier Bundle is Rough...

KangaValorant

17K views•2026-05-28

Flight Attendant Mocks Poor Looking Black Woman — Mid Air Announcement Exposes Her Real Power

SkyboundStories-b4r

184 views•2026-05-28

I FIXED My Friend’s Blown Turbo RX-8… Then Sold It

Cameron-RX8

134 views•2026-05-28

NewsWatch 12 at 5: Top Stories

NewsWatch12

1K views•2026-05-28

Simon Jordan & Danny Murphy deliver PREDICTIONS for Arsenal's Champions League FINAL with PSG

talkSPORTArsenal

6K views•2026-05-28

Botting is OUT OF CONTROL in Classic WoW (Again)...

SolheimGaming

108 views•2026-05-28

The "AI Job Apocalypse" is CANCELLED!

WesRoth

9K views•2026-05-28

STREET FIGHTER 6 - INGRID Story Walkthrough @ 4K 60ᶠᵖˢ ✔

RajmanGamingHD

12K views•2026-05-28

Trending

Computer Science

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

Paris is in SHAMBLES right now 😭

H1T1

4053K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30