By leveraging intent signals to preempt cold starts, this strategy transforms infrastructure from a reactive bottleneck into a proactive asset. It is a sophisticated masterclass in engineering foresight that perfectly balances performance with serverless economics.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
I thought this feature would be easy to deploy... I was wrong.
Added:One of the main reasons I fell in love with software development is because I love to solve problems. One such problem that I ran into recently was when it came to my audio processing pipeline for Kiru, which if you're unaware of is a nextgen video editor that I'm building in Rust. Kiru's audio pipeline is responsible for taking the uploaded audio file whenever a user goes to edit a new video and performs a number of different processing steps in order to detect silences uh transcribe the audio into words and ultimately run the detection algorithm in order to remove any bad takes silences and other unwanted segments. The pipeline itself has a number of different processing steps in order to achieve this. However, there's one key stage that's rather important to the overall accuracy, both when it comes to the detection algorithm, but also when it comes to the user performing transcription-based editing on the desktop client. This step is called the aligner and is responsible for ensuring that the words of a transcript have the correct timestamps in the audio file, which it achieves through waveform analysis. Whilst this approach is incredibly accurate, it unfortunately comes at a bit of a cost, specifically in CPU usage and processing time. How much time? Well, for every minute of audio, processing takes roughly around 20 seconds. This means for a 30inut video, of which is a very common length when it comes to Kiru, then this one processing step will take 10 minutes. Whilst I think this is personally acceptable to provide the level of accuracy that this step achieves, from a product and user experience perspective, it's unfortunately just far too slow. And therefore, in order to productize this stage, I needed to implement a solution that reduced this time down to something far more feasible whilst still retaining the same level of accuracy. To achieve this, I decided to use an approach that I think most developers would, which was to split the audio file into smaller chunks that could then be processed in parallel. This is a pretty typical approach when it comes to solving this class of problem. And therefore, I thought it would be pretty simple to implement it in this case as well.
However, as it turned out, this ended up not being true, and there were quite a few challenges, specifically when it came to deployment. Of course, I initially didn't encounter these issues.
As in order to ensure that the approach would actually work, I had decided to test with a local version. So, I went and built a small prototype consisting of three different processes. One to split the audio file into chunks, another to merge those chunks back together, and a third that sat in the middle, which was the actual worker system that processed these chunks from the job Q concurrently. In production, this job queue would be powered by something like SQS. But when it came to local testing, I went ahead and used Postgres, which yes can be used as a message cue uh by using something like the following SQL. Not only was this much simpler to get set up with, but by using Postgres, it meant that I could also track the relevant state for each job such as the individual chunks of the audio file as well as their completion status and the alignment results.
Because I wanted to be able to test the concurrency of this system, then I decided to run this on my home lab across multiple nodes. And therefore to do this I decided to use my Postgres provider Neon which allowed me to have a remote database that each worker could connect to. By the way Neon is the sponsor of today's video. So I'm going to be talking about some of the benefits and features that they provide. One such benefit is their fantastic branching feature which allows me to fork my production database instantly into a new branch that I can then use for debugging or testing migrations without the risk of breaking my production data. This is a feature that I am constantly using whenever I'm working on a change that modifies my database schema, such as the problem that I encountered in this video. Additionally, thanks to their data anonymization feature, then it means I can use these branches for debugging with an LLM without compromising any of my users data. This was a feature that came in handy when it came to debugging my chunking implementation, which I definitely made some mistakes on and heavily made use of Postgres in order to actually manage the state. By using Postgres, it also allowed me to easily track how long the algorithm took to process an entire file, which ended up being dependent on how many workers I had available to process each chunk with each worker spending around 20 seconds per 2-minute clip. This meant that the more concurrent workers I had running, the faster the audio processing would take.
This was not only faster end to end, meaning the total time to process the entire audio file, but it also happened to be faster per minute of actual audio.
basically signaling that the alignment algorithm preferred much smaller chunks.
Because of this, I felt pretty confident that I could get my processing downtime to something much more reasonable when it came to production in order to give a better user experience. And so I went ahead and began planning on how to actually implement this when it came to my production deployment. To do so, I first went ahead and removed the Neon branch that I had created for testing, followed by applying the SQL migrations to the dev branch of my database. In my case, I like to manage these migrations using SQL X, which is my personal favorite when it comes to working with Rust. Once the migrations had been applied to dev, I then went and made sure everything looked correct by using the Neon console, which provides the ability to compare schema changes of any child branches against its parent.
Additionally, another feature that came in handy is Neon's fantastic point in time restore, which allows you to restore your database to any point in the past, which is incredibly useful in the event you happen to make a mistake.
Oops. By the way, if you yourself are interested in using Neon when it comes to your own Postgres needs, then you can sign up for a free account with rather generous limits by using my link in the description down below. A big thank you to Neon for sponsoring this video. Okay, so having proved that my algorithm worked locally in order to process audio files concurrently, I began moving forward with implementing a production solution, which meant I needed to choose a deployment platform and more importantly, a deployment strategy. When it comes to deployment strategies for cloud infrastructure, there's typically been two main approaches I've encountered in the past. The first is called always on capacity, which is where you have a pool of compute resources constantly running, ready to handle any incoming workloads at a given time. Once the workload exceeds a specific threshold in the pool's utilization, then additional capacity is provisioned in order to handle the increased demand. Always on capacity is typically used in more traditional applications such as web servers, APIs, databases, and other longrunning services that need to remain consistently available in order to handle requests at any given moment.
Whilst always on capacity is useful for these types of applications, it's not so great when your workload happens to be more bursty. And so that's where a second deployment strategy comes in called on demand capacity. This is where infrastructure is provisioned reactively in response to incoming demand rather than being kept running continuously all the time. The benefit of doing this is that it can dramatically reduce infrastructure costs as you're only provisioning resources when they're actually needed. This however does come at the expense of increased latency due to resources needing to be provisioned on demand. Just in time provisioning is much better suited towards more bursty, asynchronous or highly variable workloads where demand is intermittent and infrastructure does not need to remain continuously available. This is exactly the kind of workload produced by the aligner stage of my processing pipeline, especially given that uploads can occur at any time throughout the day, but people aren't continuously uploading to it. Therefore, when it came to the production infrastructure for this processing step, I decided to go with the most popular ondemand compute platform, AWS Lambda. The way this implementation worked was actually rather simple. Basically, I had an AWS SQSQ that my audio chunk jobs would be published onto, of which would be consumed by a fleet of concurrent Lambda workers reading the job specification from my Neon Postgres database and then pulling down the individual audio chunks from AWS S3. This meant that each of these chunks could be processed in parallel by each individual Lambda worker. And due to how AWS scaling works, it would effectively scale up to as many chunks as I needed to be processed in a single given moment up to the configured capacity, which I think by default is something like 1,000.
Because of this, I was pretty confident that I had the right solution. However, unfortunately, reality didn't meet my expectations. In order to explain why, I thought it would be best to show how long it takes Kira to actually process an audio file when pointing at this lambda processing pipeline, which is what I still use when it comes to dev.
If you'll remember back to my investigation when I was doing this locally, I said it took around 20 seconds per 2-minute chunk. And so for a video that's a minute and a half long, I would expect it to take around the same amount of time when considering the other parts of the processing pipeline.
However, when I go ahead and run this, you can see it takes a lot longer, just short of 40 seconds. The reason for this is because of a certain property that AWS Lambda provides called cold starts.
This is where Lambda has to spin up a fresh environment and load all of the code before it's able to begin handling the request. Cold starts typically occur when a Lambda function hasn't been executed for around 5 minutes or so, and when all of the previously running instances have been effectively scaled back down to zero. Whilst cold starts are an expected behavior of using AWS Lambda, the issue in my situation was due to how my alignment algorithm works where it has to load a large chunk of data into memory before it can begin waveform analysis. If I take a look at the AWS logs, you can see this is around 30 seconds in total. Now, this startup cost occurs no matter whether I run the algorithm locally or whether I run it on Lambda. However, because Lambda scales down to zero, then rather than this being a one-time cost like it is with a long running service, it instead needs to be paid on every cold start. Speaking of cold starts, Lambda actually has two different startup modes. The first we talked about already being cold starts.
However, the second is known as a warm start. This is where an existing execution environment is reused, which means none of the initialization needs to take place again. to show the performance of a warm start. If I go ahead and process the exact same file using Kuro again, you'll see this time the processing step takes just short of 12 seconds thanks to the fact I have warm capacity. This meant that by using AWS Lambda, it was technically possible to hit the target time I had. However, it left me between a rock and a hard place. Whilst the warm capacity was operating at the rate I wanted it to, cold start was not, which meant that any users that would encounter a cold start would effectively encounter a bad experience. For me, this was unacceptable. And so I needed to find a way that I could have just in time scaling whilst ensuring that none of my users would ever encounter a cold start.
Basically warming up the workers before the actual job arrived. As it turns out, there's a way through a third scaling strategy, one that's much better suited for this sort of situation. This strategy is called predictive scaling and is used to scale infrastructure ahead of expected demand. The way it works is that whenever a scaling signal comes in, the system uses that signal to begin provisioning capacity before the actual workload arrives. Typically, this signal is derived from something like historical data or usage patterns. For example, if your system knows that traffic spikes every day around 3 p.m., then it can scale up to be available before that time each day, avoiding the latency hit from a scale out event. This works especially well for applications that take a long time to initialize, where spinning up compute too late can impact performance, which is exactly the problem that I was facing. Therefore, it was pretty clear that predictive scaling was the right approach. However, before I could implement it, I had one question I needed to answer. specifically what to use for the scaling signal. Now, because my traffic is unpredictable and can occur at any point throughout the day, then I couldn't use typical predictive patterns based on historical data.
Instead, I needed to find a signal of intent, something that would inform me whenever work was about to begin.
Fortunately, as it turns out, I had the perfect one. In Kiru, whenever a user creates a new project from a video, the first thing that occurs is the audio track is extracted before then being directly uploaded to S3, which in turn kicks off the processing pipeline.
However, in order to upload this audio to the S3 bucket, the client requires a pre-signed URL of which it obtains from the API server. Because this happens before any of the actual processing, then it's a pretty clear signal of intent that a workload is about to begin. And therefore, because the API call also includes metadata about the audio file, such as its total size and duration, then I could use this to estimate the amount of scale I would potentially need. Because of this, it meant I had a pretty good signal of intent in order to begin scaling. And so, all that remained was to find a way to actually use this intent in order to provision and warm up some compute.
Initially, I tried doing this using AWS Lambda through a couple of different approaches. The first was to use what I would consider to be a rather hacky approach, which was basically just sending a bunch of empty messages down the SQSQ in order to pre-warm a number of instances. Whilst this sort of worked, there was a lot of times when it just straight up didn't, mostly whenever the Lambda instances had scaled back down to zero, which could easily occur if, say, the audio file took a long time to upload, for example. Another reason this approach could hit a cold start was due to the way I was calculating the expected number of workers, which aimed to be around one worker for every 2 minutes of audio. However, if the audio produced more chunks than that calculation uh due to the fact that I tried to split in areas of silence, then this could also cause a cold start to be encountered, which would slow down the entire job. Whilst I could have added a buffer to this number, it just felt rather janky and so I wanted to use a different approach. The second approach I looked at was to make use of a feature that AWS Lambda provides called provisioned concurrency. This allows you to keep a number of Lambda instances basically always on, effectively turning them into always on capacity. However, in my case, rather than leaving these always provisioned and paying for the idle compute, I instead used the signal of intent to set the number of provisioned workers in order to scale up the provision concurrency to the expected job amount. Then by setting the reserved currency to that same number, it would force AWS Lambda to only send jobs through provisioned concurrency instances rather than encountering any cold starts which prevented the issue I was encountering beforehand when my calculations for the number of workers were slightly off. Whilst this worked on paper, when it came to the actual implementation, I ran into yet another issue. This was that provisioning of these instances was taking longer than the actual cold start itself. on average around 3 minutes. This basically meant that provision concurrency was going to be slower and I would have just been better off using Lambda with cold starts. However, this experience and implementation wasn't entirely in vain and inspired me to come up with a rather similar solution. However, rather than using AWS Lambda, I instead decided to use a different AWS product, ECS, or Elastic Container Service, which is a managed container orchestration platform that allows you to run and scale Docker containers without needing to manage the underlying infrastructure yourself. By coupling this with AWS Fargate, which is Amazon's serverless compute engine for containers, I could send an API request to the ECS cluster whenever the signal of intent came in, increasing the number of running tasks before the actual workload arrived. There was however just one question. How long would it take for an ECS container running on Fargate to deploy completely from scratch? As it turned out, this ended up being the exact same amount of time as provisioned concurrency in Lambda. However, I wasn't cooked because unlike Lambda, there was actually room to improve. In order to figure out how to speed up this startup time, I began by first measuring whether time was actually being spent. Through this, I found that 40 seconds was being spent on the container provisioning with 10 seconds going to the Fargate resources and then another 30 seconds to load the data into memory, which was expected. What was unexpected, however, was where the rest of the time was being spent as it was going towards pulling the image container down from the container registry. The reason for this is because when using AWS Fargate, there's no Docker image cache available due to it being a fully managed serverless compute environment where each task may be scheduled onto fresh infrastructure. Therefore, in order to speed up the ECS deployment, I needed to find a way to reduce this image pull down time as much as possible.
Fortunately, it seems like a lot of people have run into this issue as AWS actually provides some rather comprehensive documentation on how to achieve this. If we take a quick look at this document, you can see that they provide a few key steps. The first of which is to reduce the image size down.
In my case, I had started with an image that was about 1.4 GB in total, uh, which, yeah, was far too large. This was the first thing I needed to improve. And so, I went through the process of slimming this image container size down using a couple of different techniques.
The first of which was to make use of multi-stage Docker builds, which allows you to separate the build stage and the actual deployment stage, meaning that you don't need to bundle all of the build resources in the final image. The second technique was to then make use of aggressive image pruning in order to remove as much as I possibly could before things started to break. By doing both of these things, which I now like to call the Ozmpic method, I managed to get the image container size down to a rather impressive 503 megabytes, which was as low as I could go before the lights were starting to flicker. Upon testing this new slim image, I had managed to cut the execution time down to a cool 1 and 1/2 minutes, which was a great result, although it wasn't good enough as I wanted to try and get this as close to the baseline of 40 seconds as I possibly could. Therefore, I went back to the AWS documentation with the next step it recommended being to apply compression either through the use of gzip or ZSTD. In my testing, I found that ZSTD compression performed the best out of the two, shaving off around another 15 seconds from the total image pool time. Despite this, however, I ultimately ended up using gzip because it worked with the next optimization step that was recommended. This step was an approach called SOCI, aka seekable OCI, which is an AWSbacked image acceleration format which lets Fargate start a container before the entire image has been fully downloaded, speeding up the total startup time. This was exactly what I needed. Although, of course, there was a catch. The SOCI optimization tool is only available for Linux and not Mac OS, which is what I'm now using as my primary development machine. Honestly, it's my own fault for just trying to simplify my digital life.
Fortunately, I was able to work around this constraint by creating a simple workflow in GitHub actions which would execute on a Linux runner and would be invoked whenever this container image was built through tofu apply. Whilst this wasn't the most elegant solution, it ended up being a rather decent workaround. And through using SOCI and gzip on my postgp1 image container, I managed to get the image pull down time to just short of 20 seconds, which meant the total startup time was now around 1 minute. Whilst this wasn't the 40 seconds that I wanted, which would have been impossible anyway, it was still a rather decent result. And given that I had a lead time of uploading, transcribing, and some of the earlier stages in the pipeline, then this one minute was quick enough for the majority of audio files that Kira would encounter. However, it wasn't quick enough for every audio file. And because I didn't want any of my users to experience an unnecessary delay, I needed to make one final change. This change was to have some always on capacity provisioned, enough to handle any audio files that would take less than 1 minute before needing to be aligned. Through testing, the magic number for this ended up being around three instances of the aligner running at all times, which when using Fargate Spot only came to around $30 a month.
Now, whilst that's more than I initially wanted to pay for a single stage in my processing pipeline, it does provide the best user experience, which to be honest, when building a product is perhaps the most important thing.
Speaking of which, I want to give a big shout out to Neon for sponsoring this video. I've been using Neon now for just over 2 years and they're absolutely fantastic when it comes to meeting not only all of my Postgres needs but providing other features that just allow me to sleep better at night when it comes to managing production data. Not only this, but thanks to the way that their architecture works, they also work really well when it comes to these sorts of serverless workflows and I don't have to worry about overloading the server with multiple connections even when I'm performing a huge amount of concurrency.
Best of all, however, is that all of this is available on their free plan.
And so if you're looking to find a new Postgres provider for all of your projects and you don't want to pay for an always on RDS instance, then head on over to their website using my link in the description down below. Otherwise, that's all from me, but I want to give a big thank you to you for watching and I'll see you on the next one.
Related Videos
Walmart Manager Arrested After Stealing $670,000 - A Data Analyst 800 Miles Away Caught Him
bodycamsecretsyt
111 viewsβ’2026-06-09
GitLabβs Manav Khurana: AI Agents, Orbit, and the Future of Coding
TechVoices-live
374 viewsβ’2026-06-10
"What's the Difference Between a Class and an Object?"#class #programming #softwaredevelopment
CS-with-Alireza
349 viewsβ’2026-06-08
Why Your Computer FREEZES?
GreshamCollege
1K viewsβ’2026-06-09
Feodo Tracker: Botnet C2 Intelligence Platform #CyberCavin
CyberCavin
269 viewsβ’2026-06-06
The Operating System That Should Have Beaten Linux
BitByteTalks
23K viewsβ’2026-06-08
STCS - Class 23: How to make your Mobile App Fast
mosesmbadi
116 viewsβ’2026-06-07
Decidability | Lecture 24 | Prof. Ravindrababu Ravula
ravindrababu_ravula
110 viewsβ’2026-06-10











