This video demonstrates how to run Claude Code completely offline and free using Llama.cpp with Gemma 4 E4B model, requiring WSL installation on Windows, Python environment setup, CUDA-enabled Llama.cpp build, 70K context window configuration, and integration with the free-claude-code proxy server to enable local AI coding without API keys or subscriptions.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Run Claude Code Locally for FREE — Llama.cpp + Gemma 4 + 70K ContextAdded:
What model are we using? This is going to start working. And if you go to the llama CPP server, and you can see that it is processing here. Output is done, so cooked for 39 seconds, and we're currently using this model. We have this instance of llama CPP. We have Gemma 4E4BIT 28 model, which is very good model in a GGML format running on my llama CPP on my local system. Once again, we have the CPU and RAM usage. We have this GPU usage. You can see it's almost 100% used. And then we have this running. So, this is the place where we are running the 8080. Okay, let's go ahead and get started with everything.
First and foremost, you need to install Ubuntu desktop. So, if I click on Ubuntu, I have Ubuntu app running. So, this will start up the CLI. Then you need to do the installation of WSL, which is Windows subsystem for Linux install. So, you go to a PowerShell and open up as administrator, and then go ahead and do WSL {dash} {dash} install.
When you do this, this will install WSL on your system. So, now we have Ubuntu and WSL on your system. Now we can go to start and say WSL. This is going to start up WSL in your system. So, as you can see this started up. I'm inside an environment here. I'm going to deactivate the environment as of now.
I'm going to show you how to install that as well. We would need an environment, so I can say conda deactivate to get out of the environment here. And now we are not inside any of the environments, so you are in a fresh state. We need to install the environments as well. We can use Anaconda or Miniconda. So, for installing Miniconda, you can go to this link Miniconda install. Go ahead and go to installing Miniconda here, and then Windows shell installer, for example.
You can go to this PowerShell button, and then copy this command and run. And just follow the instructions that you see here. So, once you have that, what you can do is you can now do conda info {dash} {dash} envs. So, you'll be able to see the list of environments that I have. Now, I don't have any environments, just a base environment.
I'm going to clear everything and make a new environment, okay? So, let's say the list of folders that I have here. So, I have this ATF ENV. If you want to skip the entire Conda step itself, you can do this. You don't need to install Conda.
Just need to have Python installed and you can go ahead and create this Python environment as well, which is what it did. So, what I did was I did just Python -m ENV and ATF -ENV. Doing this -y, so doing this will help you make up or create an environment known as ATF -ENV on this home folder. So, after that, what you can do, you can do source ATF -ENV bin activate. So, this will activate the hf.env. And if that is done, you are 20% there. So, we have this hf.env hf-env and we're inside this hf-env environment. Now, at this point of time, I can go ahead and do Nvidia SMI. So, I can see that this is the GPU that I have, which is an 8GB GPU 4060 RTX GeForce Nvidia. This is my driver version. This is the CUDA version that I have. And you can see that no processes are running in the GPU as of now. At this point of time, I can go ahead and update. So, I can say sudo apt update and sudo apt upgrade -y. I can just run this, put in the new put in the password, and you can see this is updating and upgrading. Now, if sudo is not working on your system, what you can do is you can go to settings here and then search for sudo here, enable or disable sudo, and you will see that we have this setting here. So, you can just enable this sudo setting here to make sudo available on your system. So, you can see that we have updated and upgraded everything and now we have nothing. Next, what you can do is we can do some installations. So, sudo apt install -y build-essential. So, we need CMake, Git, Curl, and Wget. So, we install everything. And you can see that I have essentially installed everything.
So, this is a one-step command to install these five libraries that you see here. Next, we want to install the Hugging Face Hub as well. So, pip install Hugging Face Hub and this is done as well. So, this is the command, Hugging pip install -u and Hugging Face Hub CLI. Next, we want to get clone the repo of the llama.cpp. So we can see that we have this amazing repo of llama.cpp 109k stars. And you can see that this is llama.cpp. This is essentially we can start up an open eye compatible server using this command.
It's really easy. It's a plain C C++ implementation. As a matter of fact, Ollama and LM Studio are based on llama.cpp as well. So they are wrapper of llama.cpp. So I can see that we can just get this quantization station quantization model faster inference and a lot of control basically when you work with llama.cpp. You get a lot more control as compared to Ollama and LM Studio. So we go ahead and clone this repo. So what I can say is I can say get clone and this repo. So once you do that, you will have llama.cpp installed here or downloaded here. So you can see that this is the folder. And then you can go inside llama.cpp and then you can see the list of files that we have here.
So these are all the list of files. Now at this point of time if I go to C users, my username and WSL config, I create a new file known as WSL config.
And in here what I've done is I've written this simple thing. I've written that I want WSL2. I want the memory of 12GB because I have a 16GB RAM. I want WSL to use a maximum of 12GB and I want it to use six processes and the swap memory is 12GB. So go ahead and make this file here. Once you make this file, WSL is going to use this configuration which is really essential and I face some issues while working. So I had to restrict this to 12GB here. Okay, the next command would be to run this two commands. So C make dash B build DG D GGML CUDA on and C make build build dash J2. [music] You can use two, you can use four. This is a number of parallel processes. So for me two worked. You can try two. And if you run this, you will have the build folder on your system. So So we can see this build folder. So you will have the build folder and all the files within that. So, if I go to build to just show you, I can see that lsla, you'll see all these files will be created when you run those two commands. Now, I can go back to the llama.cpp folder here. Next, what I want is to get that models. So, you can see that we have this models folder.
So, if I go inside that models folder, you can see that we have all these models. And if I go basically from the this PC or from Linux here from Ubuntu home username and then llama.cpp models.
You can see that I have downloaded two models here. One is the Qwen 3.5 35 billion A3B and this is quantized 3 model gguf and other is this Gemma 4 E4B. So, for that, what you need to do is go to huggingface.com and then go to Unsloth and go to for example this one, which is the exact model that we have here. So, 3.6 Qwen 35 billion A3B gguf.
And if you go to files and versions here, we can see that we have this file.
You can see that we have this file. So, what I did was downloaded this file here. Click here and download the file to the models folder. Okay? This worked really fine. This is a very big model like 35 billion and I have an 8GB GPU only. It worked, but the problem with using it with a system like Cloud Code is that I could not increase the context length more than 16K. So, for working with Cloud Code Open Cloud Code or a free Cloud Code, we need a context length of more than 60K. Because the initial things that it sends to the LLM is more than 16 60K or about 64K tokens.
So, we need a context length of about more than 70K. So, which was possible with this model. So, Gemma 4 E4B. So, if I go to Unsloth again, go to the Gemma 4 model here. And here, you can see that E4B IT. This is the model. If I go to files and versions, you can see that this Q8 here this fits my GPU and RAM here. So, I've downloaded this one. So, you need to keep some space in your RAM as well because we are going to bump the context length to about 70K context window. So, we are using this model and I've downloaded the model here. And at this point of time, we are all set to get started and uh start up the server here. So, I can say clear, I go back to the llama.cpp folder. This point of time, I want to visualize the GPU use. I go ahead and open up WSL again on a separate instance. And then I say conda deactivate here. But here, I'm going to put this command, watch -n 0.5 nvidia-smi. And you can see that uh this will every point five seconds everything that is going on with my GPU. So, you can see that right now I have almost none usage here and no processes are running. So, I can keep this. Next, I want to see the CPU and RAM usage as well. So, I can go to WSL a new one and then I say conda deactivate here as well. Here, I'm going to install sudo apt install htop -y. So, this is going to install htop and then if I say htop, it's going to give me a beautiful scene of what's happening here. So, you can see that we have our memory here and of the CPU and then we have memory RAM that we see here. We have the swap memory here as well. Okay, now put uh let's put this beautiful command of /build/bin/llama.server.
And this is the model that we are using.
So, Gemma 4e4b IT quantized 8, which is really good quantization. And then we have uh the port here 8080. The host is 0000. Flash attention we're putting it on and the context length actually we have to bump this up about let's say 70,000 and uh this is 999, which means it will try to load all the layers into the GPU itself and uh whatever is possible, the max that is possible. If it's not available, if the memory is less for the GPU, then you can go for the 11GB size memory here. Okay, so let's go ahead and get started. As llama.cpp, you know, it takes a little bit of time, but then again you get a lot of settings that you see here. You basically get a lot of control and different other configurations. There are so many new innovations that comes up from Google, other companies that show you how to run this faster and sloth has come up with fine tuning exercises we using in fact and sloth models and there are a lot of innovation so you can see that here 7 8 9 5 out of 8188 of the GPU has been used and even in the ram as well we see 1.47 usage out of 11.6 this is really good so you can see that it's actively listening to 0000 8004 if you go to your local host 8080 then you can see that we have this so we have this beautiful front end we have the model loaded here so you can see that is a gemini 4 e4b contact size is 70 k tokens and you can go ahead and now go ahead and do the inference what is 1 plus 8 just put in here and you can see that it's processing you can see the token speed so that's about 22 tokens per second which is really great and with this thing now you can go ahead and configure to the free cloud code that you've seen if you let me know if you want a very detailed video of this llama cpp installation that I did just now it was a very fast paced but if you want a dedicated video controls of the different settings then you can just go ahead and comment and I will make a dedicated video to llama cpp for now so let's go back and go to free cloud code repo and get started with the cloud code everything so let me go to a folder where I want it to be so I can select this folder I can say CMD here I can go ahead and I can just clone the repo so copy this and let's go ahead and clone this repo so what I can do is I can say get clone and put in the repo so cloning into free cloud code so we have cloned this repo we go inside and the free cloud code repo and then go ahead and open on visual studio code editor so this opens up now and now what you can do is you can see the files that we have here we need the environment variables copy the dot env example and rename it as dot env in the dot env we have I have multiple videos that I've shown you how to integrate with Nvidia name open router LM studio Olama so in this video we're going to do with llama CPP. So, for that first and foremost you need to change this name here into llama CPP.
So, llama CPP and we need to put the name of the LLM here. This one, which is Gemma 4E4B IT Q8 GGUF. So, we put this model and we're using llama CPP base URL of 8080, which is and other things are can be left as it is. You can go to this readme file and see other things that is necessary. So, I can see that this report is really great, really catching up and first and foremost we need to install UV, then UV update and then install Python. So, we go to our terminal, open up a new terminal here.
Basically, go to Windows PowerShell.
It's easier in Windows PowerShell because we have all the rights of execution there. So, we paste in the command and run this. This will install from Astrel. It's going to download UV, get the installation and then checking for updates and Python is already installed. So, the next step is this, UV tool install, install the proxy. So, go ahead and paste in the command here and this is going to install the different libraries. Basically, it's going to going to do UV sync and install everything. And now we can just run this command of free cloud code and this will get you started. So, I can see that it has started and I can allow it. So, I'm in the localhost 8082 port and I can go to this admin, you can go to this admin here. We validate the admin and then you can now go to 8080 admin here and you can see that we have this beautiful one.
So, what you're doing is we we can now visually see what is happening right now. So, we have the missing key here in Nvidia NIMS open router, it's configured and in this video we're really interested in this one, llama CPP. So, you can test this out here and if I click on test you can see that we have one models available here and this service is really running.
I can see the models that is available via the llama CPP 8080 server is this model, which we have started. So, we have the 8080 service running here and this service in summary can use all these things. It can use Lama LM Studio, it can use Lama CPP, or Lama QE, can use open router keys, Nvidia names. Go ahead and check out all these videos that I have in the description. I have about five, six videos on configuring free cloud code with all these services. So, next what you need to do is to go back to the repo and so you can see that we can adjust run this. Go ahead to a new power shell, run as admin. So, we are saying Entropic off keys free CC, then Entropic base URL is this and we enable the gateway model discovery and then we run Claude. But before we run Claude, we need to install that as well. And then we need to run this command as well. So, copy this, run it on a power shell. So, before running this, before running this on a power shell, we need to run this command as well. So, once you run that, you can go ahead and start up our Claude here. Seems like a long process, but definitely worth it because this will now help me enable and use Claude code on my local system. So, I just started Claude here and you can see that the first and foremost thing that I need to do is I need to go to models here, model/model. So, and you can see that we can select all these models here. So, the first is default sonnet and then in the fifth one we can see that we have our model, which is Gemma 404B and just select this and make the effort E double F, make the effort low local model. We just getting started. And now if I send over the command, what model are we using? This is going to start working and if you go to the Lama CPP server, you can see that it is processing here.
So, that's really great. It's processing everything and we will have the output in just a moment. So, I can see that this is output is done. So, cooked for 39 seconds and we currently using this model. So, this is how you can go ahead and use. Now, very important is that since this is working, our 70K context window is working really fine. If you go ahead to this Lama CPP directly and do the inference, this will be like 22 tokens per second, but since this works with a huge context here, this will be a little less. But nonetheless, we are now able to use Cloud Code and all the commands here as well. So, I have this init command and basically let me know if you want me to create a different video on all these commands as well. You know, about 30 plus commands here and each of them is really important. You can go ahead and add your own skills as well. So, what we have done in this video in summary is that we have started the llama.cpp server on 8080 port. Then we started another service on 8082 which uses llama.cpp. We can use others as well as shown here, but we are using llama.cpp and we are able to see that it's using the llama.cpp running on 8080 and use this model that is running. And then we installed Cloud Code using this command here and then we went to PowerShell and started Cloud and now this is entirely local. You can go ahead and disconnect your internet as well, but this will work. We just need to change the model to the model that is running on llama.cpp. So, let me know if you face any issues. My GPU is 4060 8GB GPU and I have 16GB of RAM out of which I've dedicated 12GB to WSL using the configurations that I've already shown you. And yeah, let me know if you face any issues. If you want to check out how to do the same thing with Ollama, with LM Studio, with NIM, NVIDIA free endpoint, you can check out all these videos. I will put in the description.
Now, if you're looking to host your entire services somewhere, you can use this KVM 2 plan of Hostinger and you can see that I have in fact [music] kept running so many of my services here. So, you can see that entire websites, YouTube monitor, Partner Stack, the Gmail monitoring and automatic posting on X. I have all these services running and apart from that, I have Open Claw and Hermes running on the same instance as well. So, if you're interested in getting an instance for yourself, go ahead and check out the link and get a KVM 2 from Hostinger. I'm really excited. There is a lot of things to do, a lot of automations to do. So, go ahead and get started. [music] Like, share and subscribe and I will see you in the next one. All right. Now you're cooking on high heat in the middle of the streets.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 views•2026-05-29
BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 views•2026-06-03
Long-Running Agents — Build an Agent That Never Forgets with Google ADK
suryakunju
142 views•2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K views•2026-05-28
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 views•2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K views•2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 views•2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 views•2026-05-30











