安装我们的扩展，即时搜索任意视频内容

Run Hermes Agent With Local Models in 10 Minutes
本站添加: 2026-06-03

103 观看616:15nemanja-mirkovic原视频发布: 2026-06-03

Local AI models offer privacy, cost savings, and offline capability but are constrained by hardware; for Hermes Agent, Qwen 3.6 27B/35B are recommended models, with quantization (4-bit) reducing VRAM requirements from 55GB to 17GB while maintaining good performance; Ollama is ideal for testing while vLLM suits production environments; local models can achieve 85-95% of frontier model quality consistently without throttling, making them suitable for privacy-sensitive, compliance-driven, or cost-sensitive workloads, with a hybrid approach using frontier models as orchestrators and local models as workers providing optimal results.

[00:00:00]Locally, AI models are a rabbit hole.

[00:00:02]You can spend weeks comparing model sizes, GPU requirements, quantizations, and not get anything useful for actually running your business. That's why I'm keeping it practical for this video.

[00:00:12]I'll show you how to set up Hermez Agent with local models, what actually works, what you should not waste time on, and whether local models actually make sense for your use case and your business. The reasons are very simple. Your data and context remains on your machine locally and private. It's relatively speaking free. You're not paying per token, you don't have any limits, but you're limited and constrained by your hardware, and you have to actually have the hardware or buy the hardware. And you can also run it offline without internet, but that's a bit of a niche use case. But obviously, local models have downsides, and you cannot expect Open Source GPT 5.5 level of quality or code output. And we're actually going to compare GPT 5.5 to local models and see how the output is and what the quality is like, so you can judge for yourself.

[00:00:59]First, I'm going to give you a crash course on all the terminology, so you're not confused when you get into it, but I'll keep it high-level so that we don't waste too much time on this. First, it's video RAM or VRAM, and the best way to explain this and the best analogy I found is let's imagine each AI model, local model, as being one box. Certain [snorts] models are smaller boxes, certain models are larger boxes. And to fit those boxes, we need to fit them in a car. And you can go about it in two ways. You can have a GPU in your normal PC or a rack, and that's a very fast car. Let's call it Lamborghini. You can drive very fast, but you cannot fit a lot of boxes inside of Lamborghini. On the other hand, you can have a van, which is an analogy for an Apple device, which has up to 512 GB of unified memory. So, you can fit a lot of these boxes, but you cannot go as fast as the Lamborghini. That's just the best analogy I found and I hope it was clear.

[00:02:00]The next thing we are going to discuss is which model we're going to run with Hermes agent and out of a gazillion models on hugging face you can choose from, there are basically two or three models that it is the consensus right now at the moment of filming this video, it's Qwen 3.6 35B A3B or 3.6 27B. Now, it's important to check on hugging face these are unquantized models. To explain that further, imagine BF16 as being the original Blu-ray movie, the Blu-ray quality and quantized is the rip. So, instead of being 4 GB or 15 GB, whatever, it's 1 GB and is the same with local models. So, this model is unquantized safe tensors version which is about 55 GB when you download it to your SSD and let's say for example if you go to Ollama which uses quantized models, this is Q4 4-bit which is 17 GB. So, it has been compressed and you can use this on a much smaller VRAM graphics card or unified memory, but it'll not theoretically be as precise as the original model, but it's much faster.

[00:03:20]For example, I'm getting 65 tokens per second on my graphics card and the original safe tensors version is about 28.

[00:03:29]So, about double the speed.

[00:03:31]As a general rule of thumb, you want to use unquantized versions if it fits on your machine. If it doesn't, go with 8-bit or 4-bit until it fits on your graphics card or Apple machine and you can use it. And also one note for Apple silicon users, you should be using MLX versions of these models because it's just more optimized for Apple devices.

[00:03:54]And the last thing to go over before we hop in into Hermes and explain the setup and do some tests is Ollama versus LM Studio versus Llama.cpp versus vLLM. So, you'll notice there are multiple of these what do you call them? Software for serving AI models. And I'm going to focus mainly on Ollama and vLLM.

[00:04:21]And the easy way to explain this is you want to test with Ollama, play with Ollama, and then for serious agent serving is vLLM. It takes more to set up. Ollama is couple minutes to set up and you can test models different models switch models faster. And vLLM is you set a model and you run that and you can have multiple concurrent users using the same model. So, it's just a more robust software when you need it. But, to start, absolutely go with Ollama and that's what we are going to do now in Hermes Agent. To set up Ollama, we would go back here and into the home page of Ollama and you just have this >> [snorts] >> and put in your main machine or the inference machine depending on where you're hosting. If it's on the same machine, you do it on the same machine.

[00:05:11]If not, you do it on the machine where the GPU is or if it's a Mac mini or Mac Studio, you do it over there. So, you install Ollama over there. I have it installed, so I don't have to do that right now. And the next step is to download the model. You just run this on the same machine where you have Ollama installed. So, it'll pull this 17 GB quant 3.6 model and then you can keep using it easily. The next step is to go into Hermes Agent and we can create a new profile, so we don't use the main profile and if you mess something up, we can debug it with the main profile. So, we're going to do Hermes profile create or llama and then we're going to do setup.

[00:05:56]Quick setup.

[00:05:57]Custom endpoint.

[00:05:59]For all llama, the URL is wherever this is hosted. So, this can be an IP of the machine. Since this is my tail scale, I just named it RTX and it recognizes the local network immediately. The port remains the same, V1.

[00:06:17]A B H optional completion so number two and use the model that is recognized here. If you have multiple models, you can choose, but for now we'll just use this one because it's the best anyway.

[00:06:32]Context length the same.

[00:06:34]Name remains the same.

[00:06:36]Keep local.

[00:06:38]Skip. So, this is done now. We can open the chat with this profile.

[00:06:44]And we can say hi.

[00:06:46]And it'll just pull the model into the RAM and respond.

[00:06:52]So, we can see that this is working.

[00:06:54]We've verified that it's working and we can continue testing. And before we do the testing, I forgot to mention that for Hermes agent it's very important that the model you choose supports tool calls. So, you can use browser, you can use terminal, you can change files, write files, edit files and so on because that's the whole point of Hermes agent. If you're using it for a simple chat, you don't need those, but for this, for Hermes, you definitely need that. So, both of these are recommended and Gemma 4 support tool calls. And you can also research Deep Seek, whatever and figure out which ones you like, but the point is you need tool calls. And now let's go back. I cleared everything in here and we're going to do a comparison test now. Local Qwen 3.6 27B versus GPT 5.5 so frontier model and you can see the result. We'll test it on various tasks. Let's bring back a llama.

[00:07:54]And just do Hermes here. It'll open up.

[00:07:57]So this one is GPT 5.5 and this one is Quen 3.6 27B. Let me move it here so you can see everything hopefully. And the task we are going to give it is research.

[00:08:12]So you can see how it does tool calls compared to GPT 5.5 and then you can judge the final output.

[00:08:24]Okay, so both of them accepted. We're going to pause the video and then we'll go back and see what the output looks like and compare the result. And just to give you idea how the GPU is working, you can see here I undervolted it to 450. So the temperatures are much better now and you can see that we're using about 50% of the RAM and it's going at 100% utilization of GPU.

[00:08:54]So yeah, let's just wait and see how this finishes. So you can see that the local model already finished and you can see the output here. It did some tool calls here, web searches and I wanted to verify if it used web search and you can see that GPT 5.5 it's 10 minutes into it and it's still going. It wants me to confirm something for some reason as for simple task.

[00:09:21]But did you use web search? Yes, I did.

[00:09:23]So you can read it here. I wanted to confirm are these searches or web searches?

[00:09:31]And it said yes and then I asked it did you use sub agents or did you do everything yourself? It said it did everything itself. So, it finished in about 1 and 1/2 minutes. So, it's extremely fast on this GPU.

[00:09:48]And you can see that 5.5 is finally finishing. And let's compare the results.

[00:09:53]So, from top to bottom.

[00:09:58]And you can judge for yourself.

[00:10:01]This is decent as well for a research task.

[00:10:06]I like it very much.

[00:10:08]I generally like GPT 5.5 how it outputs the answer.

[00:10:16]But this is not too shabby, in my opinion.

[00:10:20]And you can see the research is very comprehensive and on first glance high quality. And you can even do this test yourself and then give it to you, let's say, Claude to analyze and say, "Hey, which research is the best? Like, which output is the best?" And then, you know, you'll get some sort of judgement on which one is better. And now, let's give it a coding task, which is to use the same context they have right now to create an HTML file.

[00:10:56]So, it'll involve coding and saving a file, so you can see how that compares.

[00:11:07]And as last time, we'll just pause this so we don't waste time and come back when it outputs the result so we can see it.

[00:11:13]Yeah, so you can see that Qwen finished in 3 and 1/2 minutes and GPT took a bit longer. I don't see the exact time for some reason. My terminal seems bugged a little bit. But let's take a look. So, this is Qwen's work.

[00:11:27]It seems like classic AI colors, I guess. The table is nice.

[00:11:33]Uh this part is not super well formatted.

[00:11:37]So, here the quick matrix, it makes sense, right?

[00:11:43]And this seems like a classic 5.5 presentation.

[00:11:48]In my opinion, it's much better than Quant's. But does this mean this is useless? No. If you give it a bit better instructions, it'll be on par with this, in my opinion, or or at least 90%. And obviously, we cannot compare a flagship model to a local model. The gap is shrinking fast. Local models are becoming more and more capable, and they'll soon catch up with frontier models, especially because frontier models get throttled all the time, and the performance just becomes worse and worse over time. Okay, let's go back. I cleared everything and started a new session, and we'll do the third test, which is to create a SVG file, so we can visually compare, because the research test is great, and it shows the tool calling, but it doesn't show us any visual comparison like the HTML or the SVG. So, let's try that.

[00:12:45]So, we have the third one, create SVG, and for Quant, we're going to do it this, and then for for GPT, we're going to do this.

[00:13:01]Let's see.

[00:13:03]So, this was an excellent test, actually, because local model took almost 5 minutes to complete, and GPT 5.5 took 22 minutes to complete the same task. And this is my point. GPT 5.5 sometimes works amazingly well, and sometimes it's absolutely useless.

[00:13:23]And maybe you get 85, 90% of quality with a local model, but you get that quality consistently and every day and every hour. There is no throttling of compute and sometimes you hit a bottleneck in compute and your performance suffers.

[00:13:41]But let's see what they did. By the way, this one wasn't even an SVG file, so I had to rename the file, which is kind of interesting.

[00:13:48]So, this is what Gwen made.

[00:13:52]I would say this is decent apart from all the white space, which is fine to me.

[00:13:57]But the quality seems absolutely decent. In comparison, this is what GPT 5.5 did.

[00:14:06]I guess both are cats and you have a fence and you be the judge. Which one do you think is better? Let me know in the comments. Which cat do you think is better here?

[00:14:16]But the point is 5.5 as a frontier model didn't create an SVG file, so I had to rename it and the quality is comparable. If nothing else, it's comparable. So, let's wrap it up for this video. Should you test and should you use local AI models with Hermes Agent? Absolutely, yes. And if you have a Mac or a GPU with 24 GB of memory or RAM, you can run some pretty capable models, quantized but still great. And if you care about privacy, if you have sensitive financial or personal documents that you need to analyze, you absolutely want to use local models for that if you don't want them to go and up in some Open AI or Anthropic server and be used in various ways. Your business or compliance requirements demand local processing, that's one use case. And you already have the hardware, then you definitely want to test and see how it works because the quality gap between local models and frontier models is closing and soon we'll get a situation where you have 90 to 95% quality in a local model versus a frontier model and the prices of frontier models keep going up and will not go down anytime soon. And the best way I found to use local models is actually to have a frontier model BD orchestrator. It creates the PRD, it creates the acceptance matrix, it slices the tasks into subtasks and issues, and then each one gets implemented by the local model worker. And then the orchestrator can review that work, and then you get the best quality with the least usage of your frontier model tokens. That's it for this video. Let me know in the comments, have you used local models? Which ones are your favorite? How do you use them with Hermes agent? And of course, subscribe and like this video if you found it useful. It means a lot to me. Thank you so much.

#hermes agent #hermes agent local #local llm models #qwen 3.6

相关推荐

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

AI Doesn't Create Bias — It Inherits It

UXEvolved

176 views•2026-06-01

Distributed Inference Challenges Explained #shorts

alexa_griffith

466 views•2026-05-31

[한글자막] OpenAI @ Replay 2026 | OpenAI는 Codex로 개발 방식을 어떻게 바꾸고 있을까요?

TechBridge-KR

1K views•2026-06-03

热门趋势

Why Batman Lets The Joker Live 🤨

zackdfilms

9222K views•2026-05-30

This spider is a VAMPIRE (Kinda...)

moreparz

2764K views•2026-06-02

计算机科学

Making Ai Choose Where I Eat

Tyrecordslol

3080K views•2026-06-03

They're Complete Trash

penguinz0

558K views•2026-06-04