Serverless computing provides automatic, rapid scaling for unpredictable traffic spikes through Lambda's execution environment model, where each invocation creates a sandbox that can be reused for subsequent requests, enabling thousands of concurrent execution environments to scale instantly; for steady-state workloads, provisioned concurrency pre-initializes capacity to eliminate cold starts and optimize costs, while requests exceeding provisioned capacity automatically spill over to on-demand scaling, ensuring no availability drops during unexpected traffic spikes.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Beyond the Basics: Production Serverless Patterns for Extreme Scale • Janak Agarwal • GOTO 2025Added:
Uh I'm Janak.
Janak Agarwal. Uh at AWS, I lead product management for the service called AWS Lambda.
Uh by way of a quick show of hands, just a quick sense of how many of you have heard about Lambda?
Cool. Maybe use Lambda in an application?
Really cool. Lots of cool people here.
You know, heartwarming to see.
Uh We're going to talk about a really interesting topic in this session today.
It's scale.
Specifically, how serverless makes it easy for you to scale.
Uh those of you who've been building applications for a while, maybe operating applications for a while, you know how difficult scaling is. And serverless makes it easy. And I'm going to try to cover two patterns, but we'll see if we have the time.
Uh in this uh session, you know, we're lucky that we have a good topic.
Even if it's a hard topic, because otherwise in this after-lunch session, with my monotonous voice, I promise you I could have put you to sleep. So, to make it more interesting, I'm going to ask you a series of questions in the middle of the presentation. Don't let me down. Work with me. Uh this is to make sure you're all paying attention, okay?
So, let us start.
Okay, let's dive into what type of scale we're talking about today.
So, bear with me, you know, close your eyes for a second. I'll not steal your wallets and backpacks, promise.
Close your eyes, okay. You you you're in the first first row, okay. Okay, close your Imagine uh you know, your favorite artist, you know, they've announced a a concert which is driving distance to your home.
You've been wanting to see this artist's concert uh for a while. The presale goes live at 9:00 p.m. What do you do?
If you're anything like 5 million people in the US recently, you know, you would go at 8:30 and do F5 on your browser.
Try to refresh it. Uh that is uh that is some level of scale.
And then if we move on, uh you know, there's a big sports events, uh like let's say the Cricket World Cup, you know, India-Australia, that will bring back bad memories, India-South Africa.
Uh 500 million people, you know, log in at around the same time to view that stream. So, that is the kind of scale that we want to target, you know, and how to solve today, you know, with serverless and how serverless makes it easy.
>> [clears throat] >> Uh let's get some terminology in order.
So, this is what a consistent workloads look like.
If you see from the start time, you know, 7:25 to 7:31, the end time, you know, there's a gradual increase in traffic, sure.
Uh there's a good amount of uh peaks and troughs, that is also fine, but it's you know, it's consistent by and large.
Uh the the peak to mean ratio is fairly low.
Okay, most workloads fit this category.
No problems there. They work well. Uh you we like them.
On the other hand, your F5, maybe when your kids' exams or results are due, you also F5ing with them, or your concert uh you know, your Cricket World Cup when Sky caught that catch, first ball, last over, you know, it spikes. And those spikes are unpredictable or predictable, but they spike, and we want to discuss how to solve for these kind of workloads.
So, how do we solve for them?
A more traditional approach would be, you know, you obviously have some compute, you're managing the VM, you're patching it, uh you're deploying your application, and so on. Uh you're maybe using some sort of an orchestrator system, maybe Kubernetes and containers.
Uh you're applying to apply auto scaling behaviors, you know, fine-tuning it to reduce your costs.
But, you know, you you pay a price for a good amount of idle costs.
In picture, uh if you provision uh for as provisioning means you're provisioning your compute for let's say around the orange line, uh then what happens is uh when the sudden spikes come, the auto scaler is not able to keep up.
And oftentimes, it results in an outage, which means your end user is not able to purchase tickets, view their kids' exams results. Maybe that's a good thing, but you know, you don't want that end user to experience that kind of uh behavior.
On the other hand, if you provision for the absolute peak, uh you have some complaints.
Uh that complaints come from your wallet. It gets lighter, because you know, overprovisioning leading to empty idle capacity could be expensive. Depends, but yeah, it could be seriously expensive.
So, you know, we don't want to do that.
>> [clears throat] >> Now, how does a serverless approach help you?
First, uh serverless services like Lambda will automatically patch your host, your uh micro VM, and also your runtime. So, when I say patch, it's both for security and for performance. You know, so uh any patches that come in open source, we apply them seamlessly for your application, and life is good.
And then, they scale to zero. What that means is if there's not a single request that your end user, you know, is sending to you, then you're not paying for anything. It's it's idle capacity is never charged. Uh it scales to zero in that sense, and again, your wallet says thank you.
But at the same time, it auto scales up and down really really fast.
And that is what we'll see next.
In response to incoming events, you know, serverless services typically uh scale the fastest amongst all forms of compute.
With me so far?
Let's dive a bit deeper. This is where the questions come.
So, to see how, you know, Lambda helps you with scaling, we're going to first learn about uh how does our concurrency work.
So, the way Lambda works is uh events trigger a function invocation.
On receiving the function invocation, uh Lambda creates a execution environment.
We call it sandbox execution environment. You'll hear me use that term interchangeably, so bear with me.
Uh we create an execution environment uh for the first time the request comes.
That is the cold start period. You might might have heard of a cold start in uh in serverless parlance.
That's what a cold start is. And then we process the event as fast as possible and give the result back to you.
Okay? This is how it works in practice.
Invoke comes, uh execution environment is created. This is invoke number one.
Uh After the cold start, the request is processed uh and the result is written back to you.
And each execution environment supports one uh request at a time.
Now, uh we are at time uh of the red line.
Let's say the request is done processing. We do not take this environment back.
We keep it around for a while.
Any guesses why?
Perfect. You know, uh so, this is to give you an interactive experience. We don't want to take it away. Uh you know, in in a short time, if another invoke comes, the cold start reduces, and you know, life is good.
There's no uh latency.
Now, let's see if at this time, you know, time is always moving, not constant, uh the second invoke comes in.
The same execution environment is reused. We don't create a new one, because you know, it was right there.
It's again, you know, optimizing for latency.
Now, the good thing is you're not paying for, you know, this gap, this idle period between one and two.
Uh this is free for you. This is the cost that we bear in order to give you a good customer service.
Now, let us say that an invoke three comes, and this is what the situation is. The second invoke is going on.
Uh who can tell me what happens? Is a new sandbox created? Not?
Okay, how about this? Uh who believes uh new sandbox is created?
And the rest are paying attention or no?
Uh so, let us see.
You're right, because you know, the second uh execution number two, invoke number two is still going on. Uh next invoke comes, you know, your uh process, your application needs to scale, we automatically do it for you.
Let us make things a little more interesting.
Here, let's say an invoke four comes.
What happens?
Do we create a new one?
You guys are going to make me finish in like 15 minutes.
Uh okay, good. You are right. You see, you know, we're using the first one.
Time moves on, doesn't stop for anyone.
What happens if invokes five and six come in at this time?
Do we create new?
Fantastic.
I want to see some other people also, like, you know, speak. It's the same five, six people speaking.
I wish I knew names, and I could cold call.
What you're right, you know, it's five and six.
Now, time again moves on. We have three more invokes now.
Fantastic.
And now it's the most interesting of all.
Here, four more invokes come.
What happens?
And you get it. Uh I think you should see four more and there you go.
So, this is how Lambda's concurrency works. You know, pictorially uh in the idle times, you will see that, you know, where graph is zero. You know, there's no uh uh invoking progress. So, you're not paying for anything. And then, you know, as you scale up, you get more concurrency up to your limit of the account and you know, life is good.
Application scales automatically.
Now, the reason why, you know, serverless or Lambda helps you to scale really easily, quickly is here.
Um you get thousand concurrency. You know, those one, two, three small numbers we're playing at.
Uh imagine this, you get thousand execution environments right off the bat.
And then, you get uh a thousand more every 10 seconds.
And this limit is per function.
So, let's say if in your account, you know, you have two functions, you know, we'll call it two workloads. One is uh you're authenticating when a new user joins that work upstream.
And second is, you know, after you authenticate, um you have to retrieve, you know, user preferences or, you know, there what what ads to show. You need some preferences for that user to to show them, you know, target them with the right ads.
So, you know, both those workloads scale independently of each other.
Okay?
With me so far? This is important.
This is another graph. In a little over 3 minutes, you can get 16 to 18K requests. Uh not requests, you know, execution environments.
Uh and depending upon the length of your requests, you know, this can easily translate to tens of thousands of uh requests per second, you know, instantly.
So, let's come back to our workload.
You saw what we did in a server full or a traditional uh setup. What do you think you have to do if you've built an application on serverless? Like, what do you have to do uh to scale it?
It's a one-word answer.
>> [sighs] >> You really have to do nothing uh because, you know, you you saw the scaling rate. Uh it automatically scales up, scales down. You're not paying for any idle times and you know, life is good.
Put on your sunglasses and then we go forward.
So, what happens after the big event?
You know, your costs are not going away.
The application still has to stay.
Uh that is what we'll quickly see next.
So, quieter periods here. The quieter period is, let's say, you know, 5:18 to 7:13, you know, first half of 7:13 in the graph here.
How do you optimize for costs there?
Uh while serverless does offer scale to zero, you know, it's not really a reality for most applications. You know, most production applications, you know, there's always someone watching cricket even if there's no World Cup. Uh you know, not every shoe or not every concert is a sellout. You know, so on. So, there's always some traffic.
So, to optimize for that quiet period is what we're going to try to see next.
So, I uh Lambda has this feature called provisioned concurrency.
It is really useful for steady-state workloads or workloads with uh you know, where you see these quiet periods or with low peak-to-mean ratios.
Uh it uh provisioned concurrency, the feature is like there's no cold starts. We keep some certain amount of capacity pre-initialized for you to serve invokes that come in.
So, literally zero cold starts.
Then, there's two types of auto scaling we support there, scheduled and target.
And we'll see uh how it works.
The pricing structure is important here to understand. Uh in in on-demand the function model, you know, you have two dimensions, requests and duration. In the provisioned concurrency model, there's three dimensions. Requests and duration exists, and there's a third dimension, provisioned concurrency. This is a nominal charge for the amount of capacity you can configure at base.
Now, if you if your uh duration uh you know, the important thing to note is like there's a 16% discount in duration if you're driving good utilization in the provisioned concurrency model.
And there's ways to do it. We will cover that in auto scaling in a second.
So, scheduled auto scaling.
You know, the name really gives it away what it does. Uh if you have a workload like, you know, top of the hour, you have to do some data processing work, you know, like a cron job, uh it will scale up in advance. Let's say if the jobs needs to start at 9:00, you can set it to scale at like 8:55, 8:57. And after the top of the hour process, you can scale down. It automatically scales down.
This is also pretty good for like deal of the day. This kind of workloads you can predict uh when do you want to give out that deal, you know, 9:00 a.m., 10:00 a.m., and so on.
So, it's fairly simple.
And then, target tracking is a little interesting.
Really interesting. Don't go by what I understand the feeling. But uh on the left, you see static provisioned concurrency and and uh allocation. This is really to drive the difference between what target is.
Uh in on the left, you know, you just set up like a here is the level I want to auto scale at.
And beyond this, you know, I will uh there's this concept that after provisioned concurrency, you know, if there's more requests, you spill over to on-demand, the previous model, the default model.
So, any requests that uh require more capacity than what you've configured, you know, go to the on-demand model.
Uh capacity uh is is kind of provisioned here. You know, you only have like less requests than you're paying for some some idle.
On the right, you have Actually, you're right. Yeah. Yeah, you're right. So, on the right, you know, you have a dynamic provisioned concurrency allocation.
Uh this allows you to set up targets. Like, you can say like some metrics uh like bytes transferred or CPU utilization.
You can set an ideal and then, you know, reactively, the auto scaler will try to meet that ideal. You know, so it will spin up new instances or new capacity or, you know, spin down uh to try to maintain your ideal uh target that you think is right for your workload.
There's some other cost optimizations that are useful to consider.
A lot of you um you know, for reasons good or bad, but a common practice that we've found is like some uh folks prefer to write two functions with the same uh sort of uh business logic. One is for test, one is for production.
Now, just like, when at least I don't when I used to be a developer, I didn't wrap in die equal to zero in try-catch.
It that thing is just supposed to work.
You know, so you don't have to test provisioned concurrency scaling, you know, beyond a point. The first time maybe you want to try out the feature.
Uh so, on test workloads, actually, you know, turn it off. It It's It's one of those things that, you know, you have to just do it to to save costs.
The next is power tuning.
So, power tuning is a very underrated uh tool. It's an open-source tool that allows you to configure the right amount of memory for your function.
And why is that important?
The more memory you configure to the function, the more CPU you get.
So, let's say if you have a complex data processing job, then you really want more CPU. You should configure higher memory. But if it's, you know, just like moving a record from that you read from S3 to another S3 look location, you don't need like, you know, too much CPU or too much memory uh without sacrificing any performance.
So, you know, power tuning will help you to understand what is the right level of memory that you need to set for your function without sacrificing any performance. And that's the key piece.
And then, we also support Graviton or arm-based instances. Uh they're priced competitively at a 20% discount, leading to an overall 34% better price performance.
So, do it. You know, no real excuse to not do it.
And then, uh for large-scale one-off events, you know, AWS has a mechanism called countdown.
So, a lot of you have account teams here, you know, coordinate with your account team, and they will help you to coordinate with every AWS service your application touches for you to have a seamless, you know, fully seamless experience.
So, remember this, you know, countdown.
Now, let's say you've set up successfully provisioned concurrency, and uh you know, you still get an unexpected traffic spike. After all, you don't know when, you know, Virat hits a six or, you know, someone takes a cool catch, you know, you you just don't know.
So, how do you do it? What do you do?
Let us see.
How would you handle it? Let's say you've you've applied your provisioned concurrency for the quiet period, and there's a spike.
What would you do as a serverless developer? It's again a one-word answer.
Yeah.
Nothing. Because uh you know, remember like we spill over to on-demand, and on-demand again has that fantastic scaling rate, you know, thousand every 10 seconds. So, you really have to do nothing.
Uh when requests exceed provisioned concurrency, there's no availability drop.
So, long story short, you know, I've I've tried to simplify it, but scaling in general is hard. It's a really hard engineering problem to solve. Serverless helps you to simplify it and abstract it all the way uh with like pretty good large limits.
Just a quick recap.
Um automatic scaling, like that's what we do. That's what serverless specializes in. Really fast scaling.
Fast capacity to you.
The fastest among the uh compute portfolio.
And then provision concurrency, you know, use it to handle your quieter periods. And you know, remember those gotchas like make sure to turn it off for test workloads and use power tuning to you know, get the best outcome.
Uh provision concurrency spills over to on-demand, so you're still covered with spikes. You know, no no problems there.
And then with serverless and the promise of no security host patching, no infrastructure management still is true.
And you know, it should always remain true.
With that, that was the the talk.
>> [applause]
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
Re: 🗣️📍theprophedu📍2026 GST 103 CLASS (E-EXAM REVISION)
theprophedu
636 views•2026-06-04
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29











