The Kame model by Sakana AI implements a tandem architecture that combines a fast speech-to-speech frontend with a powerful backend LLM, enabling real-time conversational responses while progressively refining answers through 'Oracle signals' injected back into the system, effectively balancing the low latency of direct speech models with the deep knowledge of cascaded systems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Sakana KAME: Speech AI with Tandem Architecture: Run LocallyAdded:
Sakana AI is a Japanese lab that always comes up with fascinating and creative ideas, and we have covered each and every idea from them. But, honestly, sometimes the implementation complexity gets in the way.
And the results fall short of the ambition. Today, though, they have something that is both clever in concept and solid in execution, but again, the complexity of implementation is still there.
This new model is called as Kame, which means turtle in Japanese, and it is a tandem architecture for real-time speech-to-speech conversational AI.
Now, in this video, not only I'm going to show you how to get it installed, and we will also try to implement it, but more importantly, I'm going to explain how exactly this Kame is working.
I'm going to play this video from their blog post.
Now, what is the actual problem this Kame is trying to solve?
You have two existing approaches. First are direct speech-to-speech models like Moshi.
These respond instantly, near zero latency, but they lack deep knowledge because the model has to spend its capacity on both verbal and non-verbal audio features, leaving less room for actual knowledge.
Second are cascaded systems that change speech to text, then a language model, then text-to-speech.
These are knowledgeable, but slow because they must wait for you to finish speaking before they even begin processing.
This Kame sits right in between. The idea is simple. Run a fast speech-to-speech model on the front end for immediate response, and at the same time, run a powerful back-end LLM in parallel.
As you keep talking, the backend LLM is already generating progressively refined responses and injecting them back into the frontend as what they call Oracle signals. So, instead of think then speak, Kame does speak while thinking.
Before I show you the installation, allow me one last diagram because I still believe their diagram is a bit hard to grasp. Look at this diagram.
You can see exactly how this works. Your voice goes into frontend, which has an encoder, a transformer, and a decoder.
The transformer handles four streams at the same time, your audio in, your audio out, an inner monologue, and the Oracle stream. At the same time, your speech is being transcribed by a streaming speech-to-text component and fed to the backend LLM, which can be any model like Gemini, Claude, GPT, and whatever.
As the backend generates its responses, those Oracle signals flow back into the transformer in real time, making the output progressively smarter without adding any latency. I hope now it is clear that how this Kame thing is working.
Now, let's get it installed. I'm going to use this Ubuntu system. I have one GPU card, Nvidia RTX A6000 with 48 GB of VRAM.
I will be using this UV package manager.
As you can see, I already have it installed.
If you're looking to rent a GPU on very good price, you can find the link to Mass Compute in video's description with a discount coupon code of 50% for a range of GPUs.
Now, even before I begin installation, this is where I think the Sakana AI is still struggling to make things easy.
There are two prerequisites that you need. Yes, the model is local, everything, but still you need an open AI API key.
Kami uses this for the backend via LLM calls. Secondly, you need a Google Cloud credential JSON file with speech-to-text API enabled on your Google Cloud project. So, yes, you would also need access to your Google Cloud. You can minus it, but then again, the whole pipeline won't work.
So, if you want to grab open AI API key, go to platform.openai.com.
If you want to grab your Google credential, go to console.cloud.google.com.
Over there, you would need to create a service account, grant it the speech client role, download the credential JSON file, and then that's pretty much it. So, I already have open AI API key.
I'm already logged into my Google Cloud account. If you don't know how to do that, just search my channel. I have full dedicated videos as how to create the Google Cloud credential file and also the open AI API key.
So, I'm If you're following along, I'm also hoping that you have both of these.
Hopefully, in the future, Sakana AI would just simply get rid of both of these and provide us a local or their own model, which we could install locally and then run this end-to-end speech-to-speech system.
Or maybe give us some free local Ollama-based model for LLM. Anyway, let me first make a new directory and initialize that UV. So, UV project has been initialized.
And now we can install this Kami directly from GitHub by using our UV package manager. It is going to download and install all the prerequisites.
And everything is installed. Now, let me run my Kami server. They have also shared this command where you just run it with UV on your local host at port 8998.
And it is downloading the model, so let's wait.
And the model is almost downloaded. It is huge, I agree.
And the server is running. The model is totally loaded with our other stuff. Let me quickly show you the VRAM consumption.
It touched over 18 gig of VRAM. Not bad for this size of model.
Okay, so now let's quickly test it out.
And there you go. Our server is running and yes, looks like Moshi to me. I'm just going to click on connect.
>> You know, that old picture from the Indian Ocean? It's pretty striking, isn't it? With all those fishing boats, it seems like a very busy place. I'm thinking about their daily routine. You know, and what they do with all their catch. So, how about if we talk?
>> would be, Emily?
Can I talk?
>> Mhm. That's a good question. They're not just out there to fishing for whatever live or whatever, or to discuss whatever whatever you'd like to discuss like or to discuss discuss could discuss whatever you like. You're leading for the fishing like or You see, it is not really listening to me. I will just try again.
Hello there.
The following figure is the topographic distribution map of an area in the eastern region with a map showing the direction >> Can I pause you for a second?
I need to pause you.
So, what is your name?
My name is Moshi. How can I help you next? Okay, just listen. Keep quiet and listen to my question. So, I have this problem.
Where The thing is that I have this neighbor who is a 26-year-old uh girl who has been divorced 19 times.
And whenever I go outside in my backyard and I mow my lawn, she starts jumping on the trampoline in bikini. And you know what? I have been mowing my lawn all day now.
And the grass is gone. What should I do?
I'm listening carefully. Please go on to tell me what you need help from your neighbor. I'm assuming this person doesn't live in your house, so you can't just go over and smother them, right?
So, the problem is you're out in the backyard mowing and you can hear your neighbor jumping, but you can't see them. No, I can see I can. No, I can see her.
>> Listen to you. I can see her.
>> Verify what you're describing. Yeah, so I can see I can see her and that is a problem.
That is such a distraction.
And my lawn is suffering because I'm just mowing it whole day.
And >> If it's affecting you, you might need try to focus on your own activities.
>> But can you listen?
The thing is that you're not listening.
You're not listening.
You need to listen.
Okay, now I think you're hallucinating.
I'm here for you. Please go ahead and tell me about your listening want to listen from trying to give you more advice on that. I'm here to listen to you. So, I'm not you're here that I'm here to get So, I'll disconnect. So, you see, we also have covered Moshi on the channel.
And Moshi was awesome. I'm just tunneling it to my laptop while I learn.
So, Moshi was awesome, but this one I think the word is not really working.
The speech to speech is not there.
And if it is thinking, the thinking is not there. It was hallucinating.
Um we are using a very powerful open AI's model here. But still, this is not really uh working and then we are also using Google speech API. So, it should be quite good, but it's not.
Um so, this is where I think Sakana promises a lot, but then drops the ball.
I believe that they not only need to really make their model uh, much more resilient, but also come up with their own components instead of relying on Moshi, OpenAI, and uh, Google's API.
Anyway, uh, maybe in the next version it will be much better. I'm bringing this to you as is, raw, no editing, nothing, just the facts and the real-world hands-on demo. Let me know what you think. Please follow me on X for AI updates, and please become a member if you are looking to support the channel.
Thank you for everything.
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsโข2026-05-29
Long-Running Agents โ Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsโข2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsโข2026-05-28
BREAKING: Microsoftโs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsโข2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsโข2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsโข2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsโข2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsโข2026-05-30











