Install our extension to search inside any video instantly.

The Secret Controls for your LLM: Temperature, Top-K, Top-P, etc
Added: 2026-05-28

547 views7114:51GaryExplainsOriginal Release: 2026-05-27

Gary effectively demystifies the "black box" of AI by translating complex statistical sampling into practical engineering levers. It is a concise masterclass in turning stochastic unpredictability into a controlled, functional tool.

[00:00:00]Now, hang around in the community that talks about local LLM's long enough and you'll start to hear words like temperature or top K and other funny little words like that. And these are called the inference parameters. Now, when you're running a local model, you can actually tweak those parameters. And that's what this video is about. So, if you want to find out more, please let me explain.

[00:00:25]Okay, then. Let's jump into this. The inference parameters, temperature, top P, top K, and so on. We'll get into what all that's about. Now, what's the big picture?

[00:00:34]When a large language model generates text, it's actually predicting what the next token should be based on probability.

[00:00:44]So, for example, if you give it the sky is then there's going to be a high probability that the next word should be blue.

[00:00:52]Cloudy is also a possibility, a reasonable one. Beautiful is also a possibility, but less so. And the sky is banana, the sky is, you know, a forklift truck, you know, none of those things are are good. So, the the probability of those is much much smaller. Now, the inference parameters influence how much randomness is allowed, whether unlikely tokens are permitted at all, whether repeating tokens uh is penalized, and whether creativity is encouraged. So, because it's based on a set of possible tokens that come next, you can manage how those tokens, that list of tokens, is treated, and therefore you can modify the result. So, different types of tasks require different sets of parameters. So, creative writing uh needs to be diverse.

[00:01:47]It needs to explore different types of things because you're you're you're doing creative writing. Whereas coding much more precise, much more deterministic.

[00:01:58]When you're doing thinking, then you need broad exploration of different ideas, but it needs to make sense. And if you're doing instruction following, then you need something that is direct and concise. And the the temperature and these other parameters can modify the output and therefore fit into some of these different groups. So, if you're running something like LM Studio, also Ollama, also the other big kind of local LLM inference packages, you can change those parameters for each model.

[00:02:31]So, here on the right-hand side, we see the settings that you can fiddle with inside of LM Studio. And you've got the temperature, you've got the top K, you've got the top P. You've got the min P, which at the moment don't mean anything, but we're going to go into those and you can fiddle with all of these. There's sliders, numbers you can change, which will change the output of the large language model. So, what is temperature? Let's start there. So, temperature controls randomness.

[00:02:58]So, if you want something that is more deterministic, more logical, less creative, then you have a lower temperature.

[00:03:08]If you have a higher temperature, you get things that are more creative, more surprising, but there's higher risk because it could be not the right thing to write next. The token could be the wrong token.

[00:03:22]Now, the model has internal probabilities, and temperature reshapes that. Low temperatures are more confident, high temperatures are more exploratory. So, temperature is an easy control in that you can kind of nudge it up or down a bit, and you're basically fiddling with those probabilities of what should come next. Now, all this talk of AI brings me to the sponsor of today's video, that's GenSpark. Now, here at Gary Explains, I like to explain things. It's even in the title, of course. Now, maybe something not everyone appreciates is how much time it takes to make these videos. I've got to start with research, understanding the subject so I'm able to present it. Then of course there's the materials that I need, and then there's the visuals. Then of course I actually have to record the video and then edit the video.

[00:04:10]And so in fact anything that I can do to reduce that amount of time is most welcome. And that's where GenSpark comes in. GenSpark is an all-in-one AI workspace that can help you get things done. It independently researches, creates, and executes complex tasks across multiple domains, including presentations, data analysis, brand asset design, team collaboration, meeting notes, and call automation. Now let's say I wanted to research, I don't know, let's say lambda functions in Python or something like that, and then I wanted to make some slides and some visuals to go along with that. Now of course I could just do all the research myself manually, searching on the web and finding examples, or I could even use a a chatbot, an AI chatbot, and ask it some questions. Or with GenSpark, I can ask it to do the research and create those slides, and create the visuals all in one place. I can add more visuals if needed using the AI image tool, and from there I can review the slides and tweak them. If I need more than just images, there's also an AI designer tool, which can handle everything from social media to home design. Now the example I've given you is a really easy thing for GenSpark to do. In fact, it can do other things like run its own claw agent, or develop full stack apps using its AI developer, and so much more. So GenSpark doesn't replace thinking, it just replaces the mundane work, allowing me to concentrate on the quality of the video. Now GenSpark is proving to be popular. In fact, it went from zero to 250 million dollars in revenue in just its first year.

[00:05:49]GenSpark is offering unlimited use of its AI chat and AI image user for all paid users in 2026.

[00:05:57]Plus, it's now offering a great get-started bonus. New users can try premium features like the AI web app builder and deep research for free. So, go give it a try. The link is in the description. Okay, now back to inference parameters. But, there are more ways to do it. So, there is the top P which, instead of considering all possible words, top P says, "Only look at the smallest group of words that together add up to at least a percent of the total probability." And that's what the the P is.

[00:06:32]So, you take uh you know, the the nucleus, the core ones that you want really, and you ignore the tail end ones.

[00:06:41]And then from that kind of paired-down list, you then pick one randomly. So, if you have a top P of 0.9, which is quite common, then that means that you take the top 90%. So, if we had blue was 40%, cloudy was 25%, beautiful was 15%. Well, you add all those up, you get 80%. So, there might be the next word after that, whatever it was, and that will give you your 90%, and then you pick one from there. If your top P is much lower, let's say it's a 0.5, then you're reducing possible list. So, you're only saying, "Actually, you want the 50%," which in this case would just be two, blue and cloudy. And everything else would get ignored. So, if you lower it, then you're getting less possible tokens for the next output. Now, top K limits generation to the top K most tokens.

[00:07:35]So, for example, if you've got a top K of 20, it will only consider 20 of the list of possible tokens. It it'll give you only those top 20. Everything else will be cut off and that's regardless of what their probability is. So, it has a similar effect. You reduce the number of possible tokens you want to come out next, but using a different method. Now, you're just kind of using a hard nose one. So, it's a fixed count rather than a probability curve. If you have a low top K, then that means safer, but it will be more repetitive, but it's more deterministic. A higher one will be more creative, but less stable. So, every time you reduce the what's the next possible token, then you're going to, you know, change how the output is. So, min P filters out tokens below a minimum relative probability. So, let's suppose the best token you've got to come up next is 50%, okay, but you've got a min P of 0.1, then any token below 0.

[00:08:40]05, that's 50% 0.5 * 0.1 will get removed. So, it dynamically removes very unlikely garbage tokens.

[00:08:50]It's useful because top K can still include weird tokens. For example, you know, if you say 20, well, the bottom five of those might be really weird and top P can sometimes allow junk tails.

[00:09:03]So, min P adds an adaptive quality filter by saying anything below this, I don't want to have it anyway. It's just gone no matter what else is going on, we don't want it. And there's also the other way you can punish tokens in a certain way. So, presence penalty punishes tokens, topics that have already appeared. This encourages new tokens and new ideas. So, without the penalty, you might get the cat sat on the cat because the cat. So, you can see cat here is just coming up again and again, but when you penalize cat for the first time that it's been appeared, then you get the cats on the chair while the dog watched and so on. So, you you don't get just the same thing coming up. And again, because once it's already been there, you can say, "No, we don't want that anymore." And then repetition penalty is similar, but it directly penalizes the previously generated tokens. So, unlike presence penalty, it works token by token, not on a kind of concept or a topic level. It lowers the probability of any word that has already appeared.

[00:10:07]The more times a word has been used, the stronger the punishment in some versions. So, they they're all there, and obviously you can play around with those, but of course, each model comes with its kind of recommended numbers that the people who trained the model know about. So, if you look at Google Gemma 4 E4B, so effective 4 billion parameters, temperature of 1, top P of 95, and a top K of 64. So, it's only the top 64 tokens, and you've got you only want to deal with the ones that are in the top 95% accumulative when you add up all the tokens. And that's the way it works, and and it works very well according to that.

[00:10:48]When you look at something like Qwen 3.6, the 35 billion one, they offer you different settings depending on what you want to do. So, if you're in thinking mode, general tasks, then temperature again of 1, top P of 0.95, top K of 20, and so on. We'll go into these in a bit more in a second. But here it says, if you want to do coding, then you need to lower the temperature, you want less randomness. But it still has that same top P of 0.95, same top K. I've got a table in a minute that will show all this. If you're looking for instruct or a non-thinking mode, then the temperature they're saying set it to 0.7. You only want the top 80% of the tokens and so on. Let's put all this in a table. So, we can see here that the temperature is something that they change that they recommend depending on what task it is you're doing.

[00:11:37]The top 95% gets changed to 80 when you want to do non-thinking mode. Always the top 20 tokens. Min P is okay. The P penalty, the presence penalty, is set down to zero for coding because in this case they're saying, well, coding is often a lot of brackets and equals and semicolons that come up in time and time again. So, that's how they're fiddling with that. But, the rate of penalty remains the same. So, that's what they're saying you can actually tweak them. Now, that means if you're running a local LLM, you can tweak these yourself and and, you know, have a play around and see what you different results you can get. Now, I did some videos recently about using LLMs for coding. Two videos, in fact. Are local LLMs any good for Codex or Claude code?

[00:12:22]And can you replace Claude code and Codex with open code and a local LLM? So, I did two videos exploring the same ideas, local models for doing coding.

[00:12:35]Now, I used the default parameters when I used Qwen 3.6. Not the only model I used. I used different models. But, when I used Qwen 3.6, I used those default parameters of temperature of one, top P of 0.95, presence penalty of 1.5. Some people noted in the comments and some of them quite rudely, which was quite a shock, really, that I should be using the coding settings for better results. Now, that's a fair point. I don't know why people have to be rude about it, but that's a fair point. So, I re-ran some of my tests using Alibaba's recommendations for a coding situation.

[00:13:14]Now, what's interesting is I was hoping for better results. Cuz as you if you've watched those two videos, you'll know that actually a lot of those tests failed because you couldn't you can't do this coding with a local. And lots of people in the comments say, "Oh, yeah, but if you use the right code, you can do amazing things."

[00:13:31]I ran the tests again, and the results were worse.

[00:13:35]The results were worse. Open coding Quen 3.6 failed to create a working project, which they actually managed when I had the the original uh settings for temperature and so on. So, what's the lesson to be learned here? The lesson to be learned here is, first of all, don't be rude in the comments, of course.

[00:13:52]Secondly, is that you need to experiment, and not necessarily, this isn't magic. You can't just tweak it and say, "This will make it more creative.

[00:13:59]This will make it more better at coding." Because Gemma, for example, says use those settings across all different uh coding uh different tasks.

[00:14:09]Whereas, it was Quen that used used the different ones, but it didn't actually necessarily work out as I expected. So, the lesson is, you know, there is no magic. You don't just say, "Tweak this number, and everything will get better."

[00:14:20]Because that's not necessarily true.

[00:14:23]Okay, that's it. Inference parameters, temperature, and so on. Love to hear your thoughts in the comments below.

[00:14:28]Okay, that's it. I'll see you in the next one.

#Gary Explains #Tech #Explanation #Tutorial #local LLM

Related Videos

Artificial Intelligence

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Artificial Intelligence

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

Artificial Intelligence

5 Mind Blowing Omni Uses Cases

PaulJLipsky

1K views•2026-06-02

Artificial Intelligence

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

Artificial Intelligence

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Artificial Intelligence

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Artificial Intelligence

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

Artificial Intelligence

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01