拡張機能をインストールして、あらゆる動画内を即座に検索しましょう

AI Voice Cloning Just Got Way Too Real... Scenema Audio TTS
追加: 2026-06-01

758 回視聴7922:08AIQuestAcademy元のリリース: 2026-05-31

This video demonstrates Scenema Audio, an AI text-to-speech model that extracts voice capabilities from the LTX-2.3 video model to create highly expressive, scene-aware audio generation. The model enables zero-shot voice cloning where a neutral voice sample can be transformed into highly emotional speech (anger, sorrow, laughter), accent transfer between languages while preserving vocal identity, and voice design from text descriptions without reference audio. The system uses XML-formatted prompts with voice descriptions, action tags, and dialogue tags to control emotional delivery. While the model excels at capturing vocal identity and extreme emotions, it shows limitations in multi-speaker differentiation and whisper quality. The technology represents a significant advancement over traditional emotionless TTS systems by enabling cinematic, performative audio generation for filmmaking, dubbing, and creative applications.

[00:00:00]Most AI text-to-speech models sound like emotionless robots reading a script, but the game just completely changed. You guys have been blowing up my comments asking for a way to run the most expressive text-to-speech model dropped recently. So, today I'm giving you the exact Kaggle notebook to run Cinema audio for completely free. It's a scene-aware audio engine that actually understands context, and here's where it gets really interesting. The developers actually extracted the raw diffusion audio power right out of the LTX 2.3 video model because its emotional range is unmatched. Let's look at their official blog first to see what this pipeline can actually do. They've added zero-shot voice cloning, specifically expressive zero-shot cloning. This means you can give it a completely neutral or slow-paced voice sample and prompt it to generate high-intensity or highly emotional speech, and it handles it easily. I'll play this input sample of President Obama they have here. It's a normal, calm, 16-second clip.

[00:01:04]>> My biggest disappointment as president I've heard folks say that having the families of victims lobby for this legislation was somehow misplaced.

[00:01:13]A prop, somebody called them.

[00:01:16]>> I won't play the whole thing, but you get the idea. Now, listen to the output they generated by prompting it to be highly expressive and angry.

[00:01:24]>> I got places to be, you stupid son of a [ __ ] What the [ __ ] are you waiting for? You a written invitation?

[00:01:32]>> Wow, it cloned his exact voice perfectly, but injected pure rage. If you look at how they structured the prompt, they use this XML format. If we click on the XML version, you can see the setup. First, you write a voice description like a gravelly male voice, fast talking, rough scene, absolute silence, gender, male. Then, you add an action tag describing the scene intensity like lays on the horn, spit flying, veins in his neck popping as he completely loses it. Finally, the dialogue goes inside a speak tag. Here is another crazy feature, accent transfer. They have an input reference of an Australian woman speaking with her native accent.

[00:02:14]>> Yeah, it's just kind of like weird cuz when I checked my crystals this morning, it said that it wasn't going to work out, but I just think they're wrong.

[00:02:20]>> [laughter] >> So, like what star sign are you?

[00:02:22]>> Now listen to the output where they prompted the exact same voice to speak the text in an American accent.

[00:02:28]>> [laughter] >> Oh, okay. Okay. So, a man walks into a library and asks for books about paranoia. The librarian whispers, "They are right behind you."

[00:02:39]>> [laughter] [gasps] >> Gets me every time. [laughter] >> That is incredibly impressive. It keeps the identity, but completely shifts the accent. Let's check out some more examples. Here is President Obama speaking Spanish.

[00:03:05]And here he is speaking French.

[00:03:18]It keeps his distinct American English vocal characteristics, but delivers fluent foreign languages. They also have clips in German, Italian, and Swahili.

[00:03:27]Now, look at this voice design feature where you don't even need a reference audio. You just give it a description like a young woman breathless discovery, and look at the XML setup for how she should speak. Let's play it.

[00:03:39]>> Oh my god. Oh my god, it is real.

[00:03:42]Oh my god, it is real.

[00:03:44]I thought they were lying. I thought it was just some internet thing, but it is actually here, and it is here, and it is glowing, and I I do not know what to do with my hands right now.

[00:03:55]>> That feels so alive and captures that exact breathless feeling. But, let's check out this example for a terrified whisper. Listen to me. Do not turn around. The man in the gray coat has been following us since the bridge.

[00:04:08]>> [panting and sighs] [gasps] >> I need you to walk to the cafe on the corner, order something, and leave through the back.

[00:04:15]I will find you.

[00:04:17]Do you understand?

[00:04:18]Not if you understand.

[00:04:19]>> Honestly, I'm not particularly impressed by this one. The output just doesn't justify the prompt at all. But, check out this emotional acting sample called rage to vulnerability.

[00:04:30]>> You come into my house, you eat my food, and then you got the nerve to tell me how to run my business. You know what your problem is? You got no respect.

[00:04:36]None. Zero. I built this thing from nothing. Nothing. While you were sitting on your ass doing God knows what. So, don't come in here with that attitude.

[00:04:44]You understand me? We >> I thought it was a multi-speaker setup at first, but it is actually a single speaker executing a massive emotional transition beautifully. They also have scene-aware audio where you can add environmental background sounds. Listen to this clip where they added rain and thunder to the prompt.

[00:05:02]>> GET THE LINES! GET THE LINES NOW! SHE IS [screaming] PULLING LOOSE! IF WE LOSE THIS BOAT, WE LOSE EVERYTHING! MOVE!

[00:05:09]>> [screaming] >> I SAID, "MOVE!"

[00:05:12]>> You can hear distinct loud thunder cracks right behind the speaker. They also show samples for kids' voices and long-form narration up to 5 minutes.

[00:05:21]Though, I think if you go over 15 seconds, the emotional delivery might degrade a bit. Let's move over to my notebook now so we can run our own live tests and see how it performs in real-life scenarios. If you go to my GitHub repo for Colab and Kaggle notebooks, you will see Synema audio right at the top at the moment. Click the Kaggle button and it opens up this exact interface. Before you do anything, go over to the settings on the top and make sure your accelerator is set to GPU T4 X2 because if this is not selected, the model will not run. Keep in mind that if you have a new Kaggle account, you must verify your identity first or these GPU options will be grayed out.

[00:06:01]Now, I'll hit the run all button. Our notebook is connecting and as soon as this yellow light turns green, our environment is live. There it goes. It is green and it shows our T4 setup info.

[00:06:12]Right now, the dependencies are installing which takes about 2 to 3 minutes. All right, our dependencies are done and now the models are downloading.

[00:06:20]Now that they are all downloaded, the launch cell is running which offloads the text encoder and the rest of the models into the system RAM. Once that finishes, it will give us our public Gradio link. There it is. The public link is live. If you look at the system metrics, it has consumed about 26 GB of our system RAM and the moment we generate audio, our GPU will kick in.

[00:06:41]Let's click the public link and open up the Gradio interface. Here is the UI.

[00:06:46]Right here is our main prompt box which comes loaded with a default two speaker prompt. Below that is the optional voice instructions box for custom voice design. If we open this drop-down, you can see options for two separate reference audio inputs for speaker one and speaker two and you can choose between same as speaker one, two speakers or none. The seed is set to minus one for random generation, the max duration can be pushed up to five and the speed pace factor defaults to 1.5.

[00:07:15]Below that are the seed VC voice cloning controls where 25 inference steps is the default though 50 steps can give better quality. Let's run our first live test using this default prompt right here.

[00:07:28]We're going with two separate speakers and absolutely zero voice cloning reference audio. I'll click generate right now and if we look over at the back end terminal, you can see the status updating in real time. Because this is a diffusion model and not an auto-regressive one, it does run a bit slower. Watch the progress bar right here. It's running the denoising diffusion steps for the first speaker first. Okay, there goes the progress bar for the second speaker. Since we gave it a two-speaker prompt, it runs a dedicated progress bar for each one. All right, it's done. Let's play the output and hear what we got.

[00:08:01]>> Look at this map. It's a perfect circle with the North Pole right in the middle.

[00:08:08]>> That's just a projection, not how the world actually looks from space.

[00:08:12]>> Wow, that is impressive for a pure text prompt. The dialogue between the male and female voices sounds pretty natural.

[00:08:19]Now, let's really push the limits with some expressive voice cloning using a famous voice. I'm uploading this reference sample of Donald Trump, selecting it for speaker one, and turning voice cloning on. I've copied that exact high-intensity angry prompt we listened to earlier from the Obama sample on the blog. Let's hit generate and see how his voice handles this exact intensity. All right, the generation is complete. Let's play it with this angry output.

[00:08:46]>> I got places to be, you stupid son of a [ __ ] What the [ __ ] are you waiting for? A written invitation?

[00:08:52]>> Wow, that is wild. It captures his exact vocal identity perfectly, but completely warps it into an incredibly intense, natural, angry feeling. Now, keep in mind, this is just the first generation we're getting, and the documentation does say voice cloning can slightly alter the emotional delivery. You can get an even better, improved generation if you retry two or three times, but I'm just playing the very first result for you because otherwise the video would get way too long. Let's check out some more emotions. Look at this next prompt, which is configured for a trembling voice. Here is the layout of the description, followed by the action prompt, and finally the text he has to speak. I'll click on generate audio now so we can listen to the output. All right, our generation is done. I really wanted to see how Trump's voice would sound in a complete state of panic.

[00:09:42]Let's listen to this output.

[00:09:45]>> Is is it gone? I swear I heard it right behind me. Don't look over here. Please.

[00:09:51]Oh God, please don't look over here.

[00:09:53]>> Okay, it added a little bit of weird extra noise right at the beginning, but if we regenerate it, we could easily get a cleaner take. Overall, the quality is still decent and pretty good. Tell me in the comments how you found the quality on that one. Next up, I'm providing a prompt for a feeling of deep heartbreak and sorrow in Trump's voice. I will hit generate right now and play you the final output. Okay, this generation is complete. Let's listen to Trump with a sorrowful feeling.

[00:10:19]>> You promised we'd have more time. You promised. How am I supposed to walk out of this room without you?

[00:10:26]>> Hmm, that one could have been a bit better. The sorrowful emotion could have been conveyed slightly more effectively, but a quick regeneration would definitely improve it.

[00:10:35]>> Let's keep moving. I've provided another prompt, this time in a villainous style with matching sinister emotions. Let's test this out by clicking on generate audio. Okay, the generation is done.

[00:10:46]Let's listen.

[00:10:46]>> Oh, did you really think you were the mastermind here? Adorable, truly. But the game was over before you even rolled the dice.

[00:10:56]>> That villain type voice actually sounds quite good. It's totally acceptable and feels pretty decent. Let's try an intense whisper next, focusing on suspense and tension. I'll hit generate and let you hear how the output sounds.

[00:11:10]All right, our output has arrived. Let's give it a listen.

[00:11:13]>> Stop moving.

[00:11:15]Don't even breathe.

[00:11:16]If that door opens, we run. Do you understand me? On three.

[00:11:21]>> Okay, it definitely lacks that true whisper quality. Just like we noticed in their official demos, the whisper settings aren't very effective. The model struggles to accurately capture that specific emotion during voice cloning because it attempts to clone the whisper mechanics and convert it to match our target character, which ultimately kills that distinct whispering feel. If we tried this exact prompt without voice cloning, we would probably get a much better result. Next, I want to see if it can perform voice cloning to generate a singing voice. Can it actually do it? And if so, how well?

[00:11:54]I'm giving it a prompt with a melodic voice description, an action prompt, and some song lyrics. Let's generate this and see what kind of singing voice it creates for Trump. Okay, our generation is complete. Let's listen to Trump singing through this model.

[00:12:11]>> Shadows dance upon the wall. The summer fades tonight, but I will stand and never fall beneath the fading light.

[00:12:30]>> Wow, that output was quite interesting.

[00:12:33]You get a singing voice that sounds exactly like an amateur singer. The voice cloning quality itself is excellent. It is unmistakably Trump's voice doing the singing, but the actual musical delivery is quite amateurish.

[00:12:46]That's definitely something to note.

[00:12:47]Let's try one more singing generation with a different prompt to see if the results improve slightly. I've altered the lyrics and the instructions just a bit. Let's run it and play the result.

[00:12:58]Okay, listen to the second singing output.

[00:13:01]>> Midnight rain against the pane.

[00:13:07]The music starts to fade, but memories of you remain in every shadow made.

[00:13:25]>> Yeah, as you heard, this is also quite similar to the first result, still sounding like an amateur singer.

[00:13:31]However, the voice cloning quality remains very good, and it gives the genuine feeling of the person singing.

[00:13:37]We got a solid 25-second output from that prompt. Now, let's try pushing some extreme emotions, like talking with extreme laughter. Let's check how the result turns out. I'll show you the output as soon as it generates. Okay, the generation is ready. Give this output a listen.

[00:13:54]>> You thought you could stop it? Hahaha, that is rich.

[00:13:59]>> [laughter and gasps] >> That is the funniest thing I've ever heard.

[00:14:04]>> Wow, that was a very impressive generation. You can get an even better, more improved generation if you just retry it a couple of times. Next, I gave this prompt, choked with tears, heavy sniffling. I'll make a slight change to the prompt here and explicitly set the gender tag to male. Let's generate it now and let you hear the output. All right, our generation is done. Listen to this.

[00:14:29]>> I tried. I tried so hard to fix it. Why wasn't it enough?

[00:14:35]Please, just tell me why it wasn't enough.

[00:14:38]>> Okay, so that was a highly impressive output in Trump's voice. It conveys a very sad feeling, and the generation we received is truly impressive. Let's do one or two more tests in Trump's voice before we move on to voice design. The next prompt is for rage. We want a raw emotion of anger here. Let's generate this, and then I'll play the output for you. All right, our generation is ready.

[00:15:02]Let's listen to the output.

[00:15:03]>> I told you what would happen if you cross that line. You ruined everything.

[00:15:09]Everything.

[00:15:13]>> Okay, this one was also quite impressive. It was a very angry voice and felt highly natural, like a genuine fit of anger. For the next one, I used this crazy prompt: "Shrill, bloodcurdling scream shifting instantly into a frantic, ragged whimpering." I wanted to test some absolute extreme emotions here. Let's generate this, and whatever output comes out, I'll play it for you. Okay, our output has arrived.

[00:15:39]Give this a listen.

[00:15:40]>> Get away from me. No, no, no. Don't touch me. Help. Somebody help me.

[00:15:46]>> Okay, so this could have been better.

[00:15:48]Perhaps if we regenerate it, we could get a cleaner output. Let's do one final try in Trump's voice, and after that, we'll move to voice design. This next prompt involves emotions of crying tears of happiness, which is a super complex, mixed emotion. Let's see what kind of output we get for this. Okay, our output is here. Listen to this.

[00:16:08]>> We did it. Oh my god, we actually did it.

[00:16:12]>> [laughter] >> I can't believe it's finally real. We won.

[00:16:19]>> [laughter] >> Okay, as you heard, this last generation we got was quite impressive, and in my opinion, it was exceptional. The voice cloning of Trump is fantastic. That being said, voice cloning does slightly affect the overall emotions, as they have also stated themselves. So, let's remove voice cloning now and set it to none. I will now generate some voice design style or emotional voices without voice cloning to show you the raw quality we get. I am giving a prompt to design the voice of a cyberpunk smuggler. It includes some text and all the necessary scene descriptions. Let's generate this and hear what kind of voice we get. Here, you can see the scene description was provided, which says "Distant sirens, neon sign buzzing." Now, listen to the output.

[00:17:11]>> [screaming] >> Look, I don't care who you work for. The The is clean, the coordinates are locked.

[00:17:18]And we leave in 2 minutes. You're either in the passenger seat or you're left behind. Decide now. It actually starts with a distant siren. Overall, the quality seems fine to me. Let me know what you think in the comments.

[00:17:34]Specifically regarding the prompt we provided with the siren description, you can clearly hear that it started the audio with a siren, which is quite impressive. Let's try some more voice designs. As you can see, this is the next prompt. I wanted to generate a voice resembling an ancient female scholar. Let's see what kind of voice we're going to get. Okay, our voice has been generated. Listen to this.

[00:17:55]>> You have walked a long and broken road to find this place.

[00:18:00]Do not fear the silence.

[00:18:03]The answers you seek have been waiting here since the world was young.

[00:18:10]>> This is a very impressive voice for an ancient female scholar. Now, let's test the voice design quality for a two-speaker setup and see how it is.

[00:18:19]Look here, I have provided the prompt.

[00:18:21]Speaker one has its own separate description and speaker two has its own.

[00:18:25]We have two characters here, the tech chief and the Imperial Commander. Let's see what kind of output we get. Okay, our output has arrived. It is 28 seconds long and I can see a noticeable gap in the middle. Let's listen and see what kind of output we got.

[00:18:40]>> Did you honestly believe our security network wouldn't notice a crude system breach like that?

[00:18:48]You are out of your depth.

[00:18:57]>> Wait. Wait, don't press that button. I wasn't stealing it, okay? I was I was trying to patch a vulnerability. Just listen to me for 1 second.

[00:19:07]>> All right. The two voices are quite similar. Both of those voices sound female, even though I clearly specified a male commander for the second speaker.

[00:19:16]Let's hit regenerate right now and see if a new seed fixes it. All right, the second generation is ready. Let's listen.

[00:19:26]>> Did you honestly believe our security network wouldn't notice a crude system breach like that?

[00:19:35]You are out of your depth.

[00:19:38]Wait. Wait, don't press that button.

[00:19:41]I wasn't stealing it, okay? I was I was trying to patch a vulnerability.

[00:19:47]Just listen to me for 1 second.

[00:19:49]>> Yeah, it still sounds way too much like a female voice instead of a male one. It seems like we might need to modify our descriptive text tags to make the gender distinction way stronger. Let's do one final multi-speaker test right here. We have another prompt, and again, it's a two-speaker scene. This time, we'll reduce the text gap a bit and see what happens. We'll generate it from here.

[00:20:12]Since this is also a two-speaker prompt, let's find out what kind of output we get. Our first speaker is male and the second is non-binary. Let's see what output we get. Okay, our generation is done. Let's listen to this two-speaker output and see how it is.

[00:20:27]>> This is the third time your name popped up on a security feed this week. My patience ran out yesterday. Start talking before I make things difficult.

[00:20:36]Hey. Hey. Come on, let's keep it friendly. I'm just a middleman. A person's got to make a living in this city, right? I don't know anything about the vault. Honest.

[00:20:47]>> All right, the output is fine and acceptable, but again, the two voices are quite similar. Just like the previous two-speaker generation where you might have noticed a heavy similarity, it's present here. The voices should be a bit more distinct from one another. I think there might be a slight issue with my prompting.

[00:21:04]Anyway, you can do further testing yourself since you can run this completely free on Kaggle and test it however you want. I wanted to test its multilingual capabilities and try out even more character emotions, but this video is running incredibly long.

[00:21:17]Overall, my testing shows that in some scenarios, this model is absolutely mind-blowing, while in others, you definitely hit some multi-speaker limitations. You can run all of these tests yourself completely free because I've added this Cinema Audio Notebook right to the top of my GitHub repository alongside dozens of my other custom AI notebooks. I'll leave the direct link right down in the description so you can jump in and start experimenting. Test out the different language combinations, play with the XML action tags, and drop a comment letting me know what kind of results you get. If you enjoyed this live deep dive and appreciate the effort that went into building this notebook, make sure to smash that like button, leave a comment, and subscribe to the channel. I will see you all in the next video.

#scenema audio #ltx 2.3 audio #ai voice generator #text to speech #emotional ai voice

関連おすすめ

OpenHuman VS Hermes AI: Who Wins?

JulianGoldieSEO

285 views•2026-05-29

Long-Running Agents — Build an Agent That Never Forgets with Google ADK

suryakunju

142 views•2026-05-30

This computer is made from real human brain cells. And you can buy it.

Talktmsmedia

3K views•2026-05-28

BREAKING: Microsoft’s New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2

aimmediahouse

122 views•2026-06-03

I Made the Same Anime Fight Scene in Every AI Video Generator

NobleGooseAnime

295 views•2026-05-30

Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S

cnnnews18

3K views•2026-06-01

I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)

AICodingDaily

298 views•2026-05-29

3D Platformer Update - NO CAPES

SolarLune

294 views•2026-05-30

トレンド

コンピュータサイエンス

The Meta AI Hack Is a DISASTER

LowLevelTV

141K views•2026-06-03

The Casino Had Us Guessing All Day

VegasMatt

157K views•2026-06-03

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30

The Fastest Way To Board A Plane 😮

zackdfilms

6504K views•2026-05-29