This tutorial demonstrates how to build a custom LTX 2.3 audio-controlled workflow in ComfyUI that injects real audio into the latent space during sampling, enabling precise lip-sync for talking and singing characters. The workflow processes audio through trimming, vocal separation using Melband rowformer, and VAE encoding with a zero mask to preserve audio information. Key optimization tips include using LoRA version 1.1 instead of 1.0, setting 15 steps instead of 8 for better facial quality, using basiculer for automatic sigma generation, and ensuring video shape and resolution meet LTX model recommendations. Troubleshooting involves adjusting video shape, resolution, step count, and input audio quality to achieve accurate lip-sync results.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
LTX 2.3 Custom Audio Workflow: Perfect Lip Sync + Pro Optimization Tips!Added:
Do you want to build a workflow that gives you full control over the output audio of the LTX 2.3 model? Or have you tried every possible method but still can't get your character's lip movements to sync properly with the music? If the answer is yes, then this video is exactly what you're looking for. Hello everyone, I'm Sotai. Welcome back to my channel. After I released my first video tutorial on the LTX workflow, I was really happy to see so much interest and support from you guys. And while going through the comments, I noticed one question that came up again and again.
How can we customize and control the audio generated for AI videos exactly the way we want, especially when moving into more challenging use cases like creating lip-sync videos or even making music videos. And honestly, for those kinds of tasks, if you only use the LTX imagetovideo workflow from the previous video, it's going to be very hard to get the result you actually want because the audio is only described through a prompt. So, in today's video, I'm going to walk you through step by step how to upgrade and build an LTX workflow with fully controllable audio integration.
And that's not all. I'll also give you a few bonus tips and tricks to make this workflow more stable, smoother, and capable of producing the highest quality results possible. All right, as usual, before we get started, smash that like button for me. Subscribe if you haven't already, and let's get straight into it.
All right, first let's start with the environment setup. The good news is, just like the LTX workflow from the previous video, with this workflow, we don't need to install any complicated custom nodes. Basically, there are only two main things you need to do. First, update Confy UI to the latest version to make sure the default nodes and all related components work smoothly and reliably with LTX 2.3. Next, you need to download the required model files. Once the downloads are complete, make sure you place each file into the correct corresponding folder inside Confy UI.
This part is extremely important because if the models are placed in the wrong location, once you open the workflow, Confy UI may show missing file errors or fail to load certain nodes properly. And of course, to save you from having to guess the file paths or waste time digging through every folder, I've also documented the entire setup process in detail in the corresponding Patreon post. All you need to do is open that post and follow the steps one by one.
Once the environment is ready, confi UI is updated and all the required model files are placed in the correct locations, we can now move on to the most important part, building the workflow and seeing exactly how this entire system works. All right, now that the environment is ready, let's jump into the workflow. Here you can absolutely use the official workflow version that's already available inside Comfy UI. The process is super simple.
Just click on template, then search for the keyword LTX 2.3. Right away, you'll see the LTX related workflows showing up here. From this list, look for the workflow called image audio to video.
Click on it and the entire official workflow will be loaded directly into your Comfy UI workspace. Basically, this workflow already includes all the essential components needed to generate a video from an image combined with audio. However, similar to the imagetovideo workflow from the previous video, in my opinion, the default version is still a bit messy, it can be confusing for beginners and more importantly from what I've tested. The results are not fully optimized yet. So instead of following the official workflow exactly as it is in this video, I'll show you how to optimize the image audio to video workflow using the imagetovideo workflow I introduced in the previous video as the foundation.
All right. And right here we have the LTX 2.3 imagetovideo workflow that I walked you through in the previous video. As you can see, this workflow is divided into four stages along with a few small adjustments to make the video generation process more stable. In today's video, I won't go through all of that again. Instead, we're going to use this workflow as the foundation and build on top of it. So, if you don't already know how to set up this image to video workflow, feel free to pause the video for a moment, click the suggested video on the screen, or check the link in the description, watch the previous tutorial first, and then come back to this video to continue. All right, now let's get to the main point. to create a video where the character can talk, sing, or move according to the audio you want. The most important thing is that we need to feed that audio information directly into the LTX 2.3 sampling process. It may sound complicated, but the idea is actually pretty simple. In stage two, if you look at the old workflow, we were saving a component called the init audio latent using an empty latent audio node and then passing it into stage three. That means during sampling the audio latent is basically just pure noise and the model mainly relies on the prompt and the image latent to generate the video. So what we need to do in this new workflow is create the init audio latent from the latent audio encoded from the actual audio file we want to use. Instead of using an empty latent audio node. In other words, instead of letting the model imagine the audio from the prompt, we're going to inject real audio information directly into the latent space. This gives the model much clearer data when generating motion, especially for tasks like lip sync, singing characters, or music video generation.
Specifically, here's how I'm going to set it up. First, I'll load the audio file and save it in stage one, similar to how we saved the prompt and the initial parameter values. Then in stage two, we'll retrieve that saved audio and start processing it. The first step is to trim the audio to match the exact duration of the video we want to generate. This is a very important step because if the audio is longer or shorter than the video, the lip sync or musicdriven motion can easily go out of rhythm. Next, I'll use the Melband rowformer model to separate the vocals from the audio. This step is especially useful if your audio file contains a lot of background noise or if you're working with music that includes both instrumental beats and vocals. Once the vocals are separated more clearly, the model gets a much stronger signal to understand the speech or singing part, which can help improve mouth movement generation. After that, we'll use the audio VAE model that was loaded in stage 1 to encode this audio segment into latent audio. This latent audio will then be combined with the latent image to create the init latent for the sampling process. However, before saving this latent, there is one very important detail you need to pay attention to.
We're going to use an additional mask where all values are set to zero. So, why does this mask have a value of zero?
And why does its shape cover the entire video? The reason is that our goal is to preserve the encoded audio latent, making sure this audio latent does not get changed during sampling. In simple terms, a zero value mask helps keep the audio information we injected intact so the final video can follow the original audio more closely. Once we have the complete innit latent, this latent will continue through stage three and stage 4 just like the imagetovideo workflow from the previous video. However, in stage three, there is one small change.
Instead of using manual sigma like in the old workflow, I'll replace it with sigma generated automatically from the basiculer node. As you can see here, the reason I'm using basic scheduler is because it makes things much more convenient for an optimization trick that I'll share in the next part of the video. And with just a few small adjustments like this, you can already transform the original image to video workflow into a custom audio LTX 2.3 workflow where you can inject your own desired audio, gain better control over the character's motion, and create videos that are much more suitable for tasks like lip-sync, singing characters, or music videos. Let me give you a quick demo so you can see the realworld power of this workflow. Here I've already prepared my character image and a short audio clip.
>> And this is the result.
>> Yeah, as you can see the result is honestly amazing. Right. My character is lip syncing extremely accurately to each syllable of the input audio. The mouth movement matches the rhythm of the sound perfectly with no noticeable out of sync feeling at all. On top of that, the smaller motions like facial expressions and subtle character movements also look very natural. And especially thanks to stage 4 with the upscaling process. The final video comes out incredibly sharp with details on the character's skin and hair preserved beautifully. Everything runs very smoothly, which shows that our workflow is operating in a really stable way. But hold on. I know that when you start trying this yourself, you may run into a few visual artifacts or the results might not look exactly the way you want on the first few attempts. So, before you jump in and test this workflow on your own, take a few minutes to go through some of the tips and tricks I've learned along the way.
Hopefully, these will help you master the workflow more effectively and create the highest quality videos possible. All right. Now, I'm going to share a few important tricks to optimize this workflow compared to the original Comfy UI core workflow. These adjustments are not too complicated, but based on my testing experience, they have a pretty big impact on the final video quality, especially when it comes to the face lip movement and how well the character actually follows the input audio. First, let's talk about the model and Laura. As I mentioned in the previous video, we now have LTX 2.3 distill Laura version 1.1. However, in the original Comfy UI workflow, this part is still using version 1.0. So, the first thing I recommend is to replace Distill Laura version 1.0 with version 1.1. The reason is that in the previous video I tested and compared these two Laura versions both in terms of image quality and audio handling and overall version 1.1 gives more stable results especially in facial details and overall motion. Also, there's one very important note. Not every lower that makes the face look better is suitable for a custom audio workflow. For example, some low arrays are designed to improve facial quality, making the image look sharper and more polished in a normal imagetovideo workflow. But once you move into an audio conditioned workflow, those same low rays can actually hurt the lip sync performance. To put it simply, when you add too much intervention to the face area, the model may generate a better looking face, but the mouth movement may no longer follow the audio accurately.
So, for this custom audio workflow, I recommend that you avoid adding extra face enhancement lower arrays, at least in the beginning when your main goal is to test the character's lips sync ability. Keep the workflow as simple as possible first and only after that, if needed, experiment with additional lower arrays one by one. The second thing I want to talk about is the number of steps. In the original Confy UI core workflow in stage two, the number of steps is set to only eight steps. As I explained in the previous video, this number does help the render run faster, but it's usually not enough for the model to handle more complex details properly, especially around the face area. As a result, the face can look soft, lack sharpness, or sometimes the lip movement may not be truly stable.
After testing this many times, I found that around 15 steps gives a much better balance. It still keeps the render speed reasonably fast while noticeably improving facial quality and lip sync stability. That's also why in the workflow section earlier, I replace the manual Sigma with sigma generated automatically through the basiculer node. When using basiculer, you only need to change the number of steps and the node will automatically recalculate the corresponding sigma range. This is much more convenient than entering sigma values manually, especially when you want to quickly test different settings like 8 steps, 15 steps, or 30 steps. So, if you want a relatively stable configuration to start with, I recommend trying 15 steps in stage two first. The final issue, and this is also something many of you may run into, is that sometimes you run the workflow several times, but the character still doesn't lip-sync or the mouth only moves very slightly and barely follows the audio at all. For example, like the result I'm showing on the screen right here.
>> First of all, you need to understand that this issue partly comes from the current limitations of the LTX 2.3 model. So, the tips I'm about to share won't guarantee a 100% fix in every single case, but they can definitely help you increase the success rate quite a lot. The first thing you should check is the video shape. Make sure the width, height, and number of frames are all within the recommended settings for the LTX model, as I'm showing on the screen.
This is a very easy detail to overlook because many people tend to enter the width, height, or frame count randomly based on their own needs. However, with LTX, using the recommended shape helps the model run much more stably, which can increase the chance of the character lip syncing accurately to the audio.
From my own experience, when using the recommended shapes, the success rate for good results is noticeably higher compared to entering arbitrary settings.
Next, if the shape is already correct, but the result still isn't good enough, you should try changing the video resolution, specifically the width and height values. For example, with the same audio clip and the same character image, switching to another resolution that still fits LTX's constraints can sometimes help the model pick up the audio signal better and produce clearer mouth movement. This can feel a little bit like trial and error, but based on my testing, it really does make a difference. Of course, the important thing is that no matter how you change the resolution, you still need to respect the shape constraints we mentioned earlier. If you've already checked the shape and changed the resolution, but the character still doesn't lip-s sync the way you want, the next thing you can try is increasing the number of steps to 30. When you increase the steps, the model has more iterations to refine the video, which may help it follow the audio condition better and generate more accurate movement. Of course, the trade-off is that rendering will take longer. So, I don't recommend using 30 steps by default every single time. Start with 15 steps first and only increase it to 30 when you're dealing with a difficult case or when the character doesn't respond clearly to the audio. And finally, if you've tried all the methods above and the character still doesn't lip sync, then the method I recommend is to adjust the input audio. You can cut the audio shorter, clean it up, choose a clearer vocal section, or try another audio clip with similar content. In my personal opinion, this is one of the most effective approaches. The reason is that not every audio clip provides a strong signal for the model. Audio with vocals that are too buried, background music that is too loud, too much noise, or unclear speech rhythm can make it difficult for the model to understand what kind of mouth movement it should generate. So sometimes just switching to a clearer, shorter clip or one with more prominent vocals can improve the result dramatically. And as you can see in this example, after applying the steps above, my character is now able to lip sync accurately to the input audio.
>> All right, and that's everything for today's video. I've walked you through in detail how to build a complete custom audio workflow using the LTX 2.3 model along with practical realworld tips to maximize the quality of your final output. I hope you watch the full video so you can catch all the tricks I shared throughout this tutorial. And if you found this video helpful, please hit the like button and share it so more people can discover my channel. Thank you so much for watching. That's it for now.
Goodbye and I'll see you in the next
Related Videos
OpenHuman VS Hermes AI: Who Wins?
JulianGoldieSEO
285 viewsโข2026-05-29
Long-Running Agents โ Build an Agent That Never Forgets with Google ADK
suryakunju
142 viewsโข2026-05-30
This computer is made from real human brain cells. And you can buy it.
Talktmsmedia
3K viewsโข2026-05-28
BREAKING: Microsoftโs New Image Generating Model Beat Out GPT 1.5 and Nano Banana 2
aimmediahouse
122 viewsโข2026-06-03
I Made the Same Anime Fight Scene in Every AI Video Generator
NobleGooseAnime
295 viewsโข2026-05-30
Nvidia Bets Big On AI PCs | New Chip To Power Windows Laptops | Technology | AI Updates | N18S
cnnnews18
3K viewsโข2026-06-01
I Tested NEW Opus 4.8 on Four Projects (Updated LLM Leaderboard)
AICodingDaily
298 viewsโข2026-05-29
3D Platformer Update - NO CAPES
SolarLune
294 viewsโข2026-05-30











