Building a platform layer with managed workers, standardized configurations, and automated infrastructure management transforms durable execution from a tool usable by few teams into a production-ready platform that enables rapid adoption across many teams, reducing time-to-production from weeks to days while maintaining security, observability, and operational control.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
OpenAI @ Replay 2026 | From Adoption to Production at ScaleAdded:
I'd like to welcome to the stage Shat Vado. Uh he's a member of the technical staff at OpenAI. And if you were here for the keynote, you'll note that Vincod alluded to this talk where Chhatise will be giving us a deeper look into how OpenAI leverages temporal for critical orchestration workflows at massive scale. Everyone Shhat.
>> Thank you Alex for the introduction. Hi everyone. I'm Shat. I am a member of technical staff at OpenAI uh working under the applied infrastructure org. I was one of the first two engineers who pro uh introduced temporal cloud to our product teams when we saw the need for uh an infrastructure to support longrunning agentic workflows and uh and durability. So today this talk is going to focus mostly on how we adopted temporal cloud into openi uh not just as a tool which few teams can use but also making a platform around it which so that many teams can like trust the system. So uh my talk is going to be organized into three parts. The first one I will uh go into the use cases which open uses temporal for and why these workloads are a good fit for durable execution. Second, I'll talk about the platform journey. How we started the adoption, the lessons we learned, uh how we grew into the paved path after that.
Cool. So yeah, how open uses temple today? You already uh saw in the keynote we discussed a bit about like different uh workloads we have. They are mainly fall under uh agentic workflows, infrastructure control planes, data connectors and retrievalss and business processes. These uh uh these workflows are like not the same application but they do share the same need and the same shape. They need to be long running.
They have uh code uh dependencies which can fail. They need state to be durable and they cross ownership boundaries. the uh a workflow can call into a model, a storage engine and whatnot. And all of these have different latency needs. Some can be fast, some can be slow, some can fail. And you need retries to make sure your agentic workflow succeeds.
One concrete use case uh we run uh temp we run uh on temporal uh on charges.
This is like a very good use case because this was not something which we started on temporal uh when it was launched. Uh it when it uh went live like sometime in uh spring last year it was still a synchronous uh request response flow. Uh so the user for the user they will just submit a request in charge dbt like create an image for me and they'll just wait for the response for the image to come back. But as you all know it went viral in like 2 three days. uh everyone in the world was creating Gibli images and our whole of infra was like running hot. We were unable to support the amount of load which was coming to us and we started running into rate limits uh rejecting request but that was not a good user experience. So while some of the infra teams were making sure that the uh the current infra is able to handle load the other part of the infra where I was like we were trying to move the synchronous flow into temporal so that we can like absorb uh all the back pressure happening from the underlying infra and still able to serve all the requests which the users are serving us. uh today uh we are creating like 1 billion images per week and everything is running through temporal.
So the usage at OpenAI has massively exploded in the last year. We started early uh in Q1 of last year and then uh it grew by 20x in Q2 and then it grew another 3x in H2. It's like 60x growth in uh in one year. We have about 650 name spaces. That was like two weeks ago when I checked. Last night I checked, it's already 700 plus. There are about 100 plus workers running and thousand different workflows uh different types of workflows running on temporal cloud.
So how we got here the so now I'm going to talk through the platform journey uh the experience like how we started how we uh provided this into uh hands of few people and where how we ended up with so much growth. So to begin with we uh basically focused on a safe starting point. What it meant was starting narrow making the uh risky parts explicit before making the uh path cell serve. So we end uh we included a very simple SDK wrapper to make sure the users are providing safe inputs and getting back safe outputs. We uh also contributed to temporal's terraform uh repository to make sure uh we are able to spin up name spaces, accounts, users and whatnot. And also uh we provided payload encryption uh natively inside the workers. We def explicitly differed on providing how to spin up workers and key rotations, how to scale those things. And basically the reason why we went with the scope was because we wanted to like first test it out what the boundary and security risk are around this before we can like make it available to everyone.
So naturally once few teams started adopting uh more teams started seeing the benefit of using temporal uh like they started shaping the same uh agentic workflows they have around temporal and like more and more use cases started coming to us and uh we started seeing friction like uh developer friction arising on so many uh infra pieces we had to like bootstrap for them and we were basically stopping them are uh slowing them down from launching product.
So this slide captures some of the learnings we saw uh uh learnings we had and like what kind of developer friction it ran into. So since we didn't provide a bunch of like how to bootstrap workers and all uh these users have to like set up the same infra every time again and again and this was like basically product folks had to like write infrastructure code before they could actually write any meaningful business uh temporal workflow. operational issues were hard to tr like as you all know there's a lot of things going on through temporal you have workflow history you have worker metrics you have temporal cloud metrics SDK metrics and so on like it's very hard to like stitch everything together that made incident drives slow for developers and uh workflows change daily uh users don't care how what code they are changing like and they run into nondeterminism issues and we had to like work through with them to like make replay testing available and make it safe to use. So the takeaway here is the platform basically had to make temples safe path obvious but not just possible.
The next part of the platform journey was like questioning ourself now that we have seen the friction uh with the users what should we do uh from the support requests we have been seeing how can we automate bunch of things so that the time to first workflow for the user is minimized.
So we started working on this pave road like the first part of that pave road was the manage workers. We uh introduced a simple tool uh for product developers or anyone who wants to use temporal. Uh it's called gen temporal manage worker.
When they run this, it asks for few inputs. Name, owner, tier, namespace retention. Basically uh the name of your agentic workflow, any logical nameace you want in temporal, what's the retention you want? when they click enter all of the generated specs like for worker service uh SDK packages golden images uh the whole build files work for configs automatically get generated for them.
So basically what this is saying is like the tool basically made the infrastructure path repeatable for them that they don't they didn't have to like keep thinking running uh think about how to write worker cube specs and all every time again again before they start wanted to write a new workflow.
The other part of the pave road was the manage workflows. So uh like today if someone has to like uh write a new workflow they need to like think about always register that workflow uh in their worker at bootstrap. Often times people miss that. Then uh like where do I deploy my workers? Where do I have capacity? What cues to choose? How do I separate those things out? Should I like be running workflows and activities on the same queue? There are so many uh issues around that. like you as you like run them in production you see so many uh like you have to think about so many different things like to make them scale. So uh we wanted to establish a clear boundary here that the platform team uh the product team only thinks about the application code. They are only responsible for writing the workflows and activities and the rest of the part is with the uh platform team here. So because we are a Python shop at OpenAI, we basically made use of decorators which already are provided by uh the Python SDK for temporal. We extended that to for the users to let them define their task use right there, the name spaces they want to run in uh the workers they want to run in. So they can just like write that everything in the code itself. And once that code is committed, we'll generate the config which basically splits out the workflow specs for them and the work worker spec for them. that includes basically that ties down where the work what the workflow name is, what queue it runs on and then for that queue, who is the worker responsible for that. So basically when you deploy your worker, these configs are generated to bootstrap the worker code. Uh so it ties down all the cues, all the workflows. So you don't need to like even worry about those things. It automatically runs for you.
uh this is the architecture uh which made the uh paved road possible. So on the left is the product services like the workflow code, the signal handling, all the business logic they need to like run charge and then on the right you see the temporal cloud which is which has all the name spaces, task cues, durability, event history and everything. And in the middle is the platform layer which our team provided.
I already talked about the managed worker and workflow configs. We also uh wrote down a proxy. It's written in go which basically is a entry point for all the uh request and response before it uh we commun to temple cloud. So the reason we uh added this proxy in between was to like help us uh add more authentication make sure the routing is done. So your workers could be running in EU, US, you want to do residency and many other things and we don't want the application code or the worker code to think about that. So when the request comes, we figure out where the request is coming from, which service it's coming from, which cluster it's coming from and we'll figure out the routing for you and put it in the right name space in temporal cloud.
Then we have the temporal operator which I'll discuss in the next slide. And uh the workers and autoscaling are also part of that whole platform layer. We like as I said like we make sure the worker specs are generated for you and uh we help them scale based on metrics.
We uh since we are generating the worker specs it makes it easy operationally for us to like make it common across all the fleet and introduce new things or remove few things as needed.
And there's of course uh the shared support layer across all the product services platform layer and cloud uh which is the secrets encryption and observability the temple UI dashboard traces and everything.
So this is the temporal operator I briefly uh mentioned about in the previous slide. That's the control loop for us which basically takes in these configs which I just talked about in the previous few slides.
uh the namespace config, the workflow config, the worker config. Once these configs are merged into the repo, they get deployed as config maps into Kubernetes. Once they get deployed, these temporal operator sitting uh in uh are all the clusters are basically reconciling the state, watching for these uh config changes and producing these artifacts which are like the actual resources needed to run your workloads. So the temporal operator will look at the namespace config establish the name spaces create namespaces in temporal cloud or if the namespace already there somebody changed the retention it will go update that as well for you. Then uh it will also establish the identity and permissions to run your workflows and workers. So since I mentioned like in the previous slide our proxy is able to like enforce identity and o for you. So you like somebody can like write extra arback roles like our own custom arbback roles which we introduced to make sure like some teams are not able to uh see someone's else uh workflows or name spaces or within the team itself who is able to even like decode the data in the UI. All of that is enforced through our proxy and those are the identity and permissions which the temporal operator establishes for you and of course the Kubernetes resources like the worker pods and stuff like which all clusters you want to run in that's also defined in the config and then the operator takes care of creating the service accounts for you. So you can think of all of this could actually be done in terraform but at our scale terraform was making us slow and we had to like think about how we can like move away from Terraform uh so that the we so that like the product folks don't need to like think about infrastructure and that could only be handled by the platform team and the only way possible like for us at that point was like uh the product team only thinks about the application code and just making use of that how can we translate that into cube resources and many other things.
So this architecture uh had a good payoff it established a clear boundary for us. So uh there were enforced controls at the platform layer uh as I talked about identity routing security access payloads and ownership. If we uh didn't enforce all of these at the platform layer, then every consumer has to think about all of these themselves and uh there's no easy way to like control sec uh have good security around what the workflows are doing, who is starting that like you all know like in agentic workflows like it's easy for one task to like interact with another agent and it just so on and we wanted a bit better enforcement when that happens.
So benefits here were like no worker was able to directly connect to the temporal like even like for submitting a workflow somebody has to go through a proxy like the CLI even we added our own wrapper over that so that like all the identity authentication happens through a proxy for those CLI requests going to temporal cloud.
So now that we have uh automated bunch of things, are we still ready to like launch this in production? There's so like we covered the day zero like user has written the workflow, things are set up, but are we ready for production?
There are a bunch of other things we need to take care of here. Uh which covers like observability, uh launch readiness, uh dashboards, uh etc. So first part here is observability. So temporal UI in itself is really good at providing you a deep uh deep down uh deep down uh like history of workflows like how your activities are doing, are they failing, are the child workflows failing, uh are you missing on heartbeats etc. But the thing which we saw and I called out in the previous slide too was we didn't have a way to like uh tag along the metrics logs and traces from here and we were already self-hosting the even though we were using temporal cloud we were actually self-hosting the UI ourself. We didn't let application developers access temporal cloud UI directly. So what self-hosting the UI provided us was better SSO controls, better arbback policies like have our own arbback policies according to open standards and since we were already self-hosting at this point we were able to like patch the UI directly and as you can see in the workflow history we have those data dog links and traces links directly. So from the workflow itself on that workflow page you can just click on that link which will just take directly take you to those traces where you can see what the long running activity workflows are doing for you and even the history of logs like exact sequence. So all of these were instrumented by the SDK wrappers and patched directly into the UI for temporal.
Uh now that you have the UI what about the metrics? So we as a platform team we provided dashboards to monitor health of our workflows and activities. This is like a screenshot of our internal dashboard. Uh it basically shows the health of your workflows activities how they are doing. Are you running into any errors like it might seem like your worker is healthy but your workflow can still be stuck? Your nameace is healthy but your task might be still be growing long enough. Uh so how can we go dig deeper into these things? So we provided like a very uh useful dashboard here so that like someone like can go filter down on their namespace directly their workflow type and like drill down on exactly onto what's happening in the system and we since we were also like generating bunch of configs for them it was easy for us to like create automatic alerts and tag their own call directly as you saw in when at the part of when I described the part of generating the config we asked for the owning team that's basically a hope for us to like basically tell the when something fails for that name space who to alert. So it was easy for us to like automatically create alerts for them and like page them.
Uh third operational part here is the worker scaling. So it's very easy and tempting to like go increase the number of worker replicas. Uh be it uh like if you see the task you're growing but it's not always the case where you want to increase the replicas. It might be like the workers are running uh less number of slots. Uh you're not uh tuned the workers correctly. Uh there might be some CPU starvation there and stuff which you might want to like profile. So we also uh basically as part of this we provided uh in the autogenerated worker specs like a way to like scale your workers based on these signals. So we instrumented the polar uh the work uh the worker slots as a way to know whether your worker is starving or not and recently we started also using the polar autoscaling feature which tempora cloud native provides. So this is so we it was very easy for us to like integrate that thing into our ecosystem for across 100 plus workers just because we were generating these uh specs easily for everyone and it made it uh like we could just like change one of the templates we wanted to generate and it's easy to like just adopt for everyone.
So uh now that you have all the observability but are you still ready for launch? So the at openai anytimes uh some uh new product needs to launch we have these launch readiness huddles where all different infra teams sit together thinking uh like for every infra like are we scaled enough like do we need anything else for temporal what it specifically means are the name spaces we are launching this new workload or use cases for are they scaled to expected APS or RPS do we have the expected capacity to scale workers do we have enough namespace headroom And uh like are the workers spread across different uh regions like different clusters uh and like what about the underlying dependencies GPUs are they uh like do you have enough back throttling like what happens like if the GPU fails or any storage engine fails are your workflow code written correctly like do you have enough retries and like to make sure your workflow eventually succeeds. So uh once so what we basically do is for big product launches we end up doing a load test before uh like a few few weeks ago before when it gets launched that helps us like account for the APS RPS limits we need to uh get it increased to and uh yeah like we basically work closely with the temporal engineering team to when we are doing those those load tests so that they can also watch metrics on their side making sure they're not seeing anything different than what we are seeing and they can like change like uh move across different cells as needed.
So talking about the developer outcome we wanted here. So like when we started in the platform journey it used to take like one to two weeks to adopt like even start writing a workflow.
Users had to like write terraform configure secrets encryption deploy workers scale them monitor their own signals. Now after the manage workers and workflows after we created the platform boundary it's now very easy everything is like within a day like users can like just decorate their workflows and activities as part of writing the business logic once that's merged the configs are generated our operator picks it up spins up the resources for you and you have your workflow deployed and running in production within a day. So this is basically how we are able to like handle that scale which I showed you previously like the 60x growth like without all of this it was not possible to like uh get to the state like and launch uh product use cases.
Cool. So the main lesson I want to leave you with here is like the durable execution rewards platform discipline.
So as like the critical workflows uh as more critical the workflows become the more important those platform uh pieces are. So you end up having different teams with different temporal expertise in your uh company and they have different expectations and but what you really want is like having a platform which the users can trust and like depend on and don't worry about those infra pieces themselves. So if I were to summarize this whole journey in one sentence, it would be like temporal makes durable execution possible but the platform path makes it repeatable. Thank you.
All right. Thank you very much to teach.
Uh we'll take questions now. If you are on this side of the room, I ask that you please raise your hand and I'll come find you. If you're on that side of the room, please line up behind the mic in the aisle and we'll alternate. So, any questions here?
>> I'll go. Uh, first off, very interesting talk on managed workflows. I have a two-part question. First is, do you use ondemand or provisioned capacity for temporal? And the second is, what's your peak burst usage that you see in APS and do you ever hit the limits? So we do uh we we don't use on demand. We specifically switch it off for many of our name spaces. Like once our name spaces grow beyond a EPS limit, we don't want to like continuously grow like because it's also expensive for us and we want to like make sure that user themsel is not like running into like not running some backfill jobs which is like polluting some other workflows in that name space. We actually enforce uh rate limits through our proxy for such cases too. Uh so we do use uh provision but not beyond a limit. That's when we start looking at our own dashboards alerting rely on that to make a sign like help us know like when we should like ask for more capacity and how much more we want. And in terms of uh the burst I can't really uh disclose those numbers but yes we do see those burst happening all the while.
>> All right thank you. Thank you.
>> All right we'll take a question over here next.
Um great talk thank you. Um I'm curious um how many engineers did it take to u manage such a platform like you started early 2025 and then obviously the growth was pretty huge over a thousand workflows now like yeah what does it take?
>> So uh we started with like two engineers uh including myself and now today we only have a team of four managing everything.
>> All right. Do we have uh questions over here next? Yeah, kind of more a specific question, but I noticed on your slide you said you had like 700 name spaces but only like a 100ers.
>> Are you doing something like uh like one worker is like for multiple name spaces it seems like?
>> So uh yes and no like what happens is like all of these namespaces are across staging and production. So uh we create when user has to like create a namespace they only provide a logical name to us and we create multiple of those namespace actual physical name spaces for every region the workers needs to be for both the staging and production environment. So it just like expands in that way. So and that's why you only see 100 workers.
>> Do you provide like any uh like guidance on like or do you like work with the teams to provide guidance on like when they should be spinning up new workers?
We always like uh recommend them to use a new worker like for any new use cases they are coming up with or any new namespace like just use a new name space whenever you need to spin up a new use case because it's just very simple on our side like to spin up spin these things up like before when it used to take two two weeks like we would like okay maybe for testing you can like still use the existing one but then you would run into issues in production.
All right, next question over here.
>> Hi. Uh, really impressive uh progress. I just wonder are there any uh hot spots or issues you still noticed in your system and uh what in your mind the system will be involved in next 6 months. So uh the growth we have uh I mean we had to like continuously keep scaling our proxy to meet that demand uh like run across more uh clusters and uh keep scaling them based on different like request patterns we are seeing. Uh as for the next 6 months uh I expect even whatever like the 60x I said like maybe doubling that like it's just continuously growing. So it's like right now as one team like basically sees like oh this fit this was a fit for temporal or the next agentic workflow also we want to be on temporal. So it's like a continuous like growth every day we'll just keep on seeing more and more workflows and name spaces spinning up and I only see it doubling or maybe tripling.
>> Okay, next question from the aisle.
>> Hi uh thank you for the great talk. I have a question around security. Do you guys do anything about uh preventing sensitive data from flowing over the network like encryption or how do you handle that?
>> So our proxy does the encryption basically anytime the request goes through the proxy it encrypts all the payloads the sensitive payloads before it goes to temporate cloud and similarly on the response side it will decrypt it before sending it back to the worker.
>> Got it. And so you do this for every message, event, heartbeat, any kind of >> any any payload field uh which is like the request response in uh temporal uh gc proto protocol. I I believe that excludes search attributes and workflow ids.
>> Got it. That makes sense. Thank you.
>> All right, we'll take next question in the back here.
>> Thanks. Uh great talk uh Ned from Instabase here. Um I'm curious uh what was the thought process behind uh going with uh temporal cloud versus self-hosting or did you start self-hosting and then move to temporal cloud? So when we were starting we were evaluating between self hosting and cloud but we did see a big growth coming towards us like uh like a lot of agentic workflows just waiting to launch on temporal and at that point like just given the engineing bandwidth on our side and also like good feedback from customer. We did talk with different customers of temple cloud to make sure this actually s uh like scales well to their workloads and on based on those feedback and we actually ended up going with temple cloud and it has uh not like we don't uh like this decision has been really good on our side. Yeah.
>> All right. Next question from the aisle.
>> Hi. Uh thank you for the great talk. Uh the most outstanding aspect for me was when you said just now that you are a team of four people. So can you tell me like what are the decisions in the past uh of this entire journey which you think were the most pivotal from an operation standpoint and a technical standpoint like I find like you you talked about creating your own wrapper over the CLI. you talked about creating uh you know decorators based workflows rather I mean the developer workflow rather than going terraform. So I think these are really good uh you know things of wisdom we can take back but can you can you point them out for us? Thank you. Uh like basically are you asking like why we move from one to the other or >> or what were those things not not why is pretty clear but what were those things uh which you which you found like you could support like such a big uh use case uh you know domain with just four people. So yeah, this like the managed workflows and workers were like the pivoting point for us in that like whole journey like like we were still using Terraform and like deploying worker specs manually like our team like at that point maybe three people were like helping like embedding themselves into the product team making their workers bootstrapped like making sure the launch succeeds like we were there in the launch room with them and it was just not scaling like with so many like just three people like there's so much overhead on the platform team itself and that's what led to us like rethinking like what are the common support problems like we are seeing from our customers how can we like make that automated and that's what how we went into this managed like we had this idea about managed workflows and workers and we proceeded with that >> cool thank you >> all right next question here in the back >> hi um Michael from Unifi here thanks for the great talk um I have kind of a multi-art question one on scalability and the next on sort of operations. So it seems like you made a lot of great strides in terms of dev velocity uh with managed workflows but uh as workloads change and people edited the workflow code for example how did you guys think about tuning your workers and sort of what was the process to figure out what lever to pull in certain situations? Um that's part one and part two what did the sort of operational handoff look like between the product and the platform team when it came to those scalability issues? Uh okay. So for the first part uh like so when we were establishing workers for them like uh we wanted to make sure they don't like worry about scale at all like like that's something we can handle like make sure uh like we are able to like tune the name spaces like the task use like we provide a lot of guidance like automated code reviews based on AI like we have our own uh developer like doc like what are the best practices to use for temporal workflow like those are embedded into our peer views like and uh like if somebody's like trying to put both workflow and activity code on the same task view it will go pinpoint oh you need to split those things so naturally at the point when they are like already just writing the workflow code to begin with we'll just stop them there and make sure that they are following the best practices before they see this issue in uh production but of course there are few people like still like go ahead merge that code and run into issues at that point we have to like continuously monitor today like how are the workers doing uh like we monitor based on backlog are they like uh scaling well is there a backlog growing are they running hot on CPU so once we see those issues we'll actively or proactively reach out to the customers that they're on call work closely with them to figure out what's the issue resolve their uh things in code and make sure things uh yeah they're able to scale >> all right uh thank you for their very insightful talk and also Oh uh it looks very impressive about data adoption speed uh from one to two weeks to the one day that's great but uh related to that I have a question so the that means that more and more people easily create data workers and then data temporal workflow that sounds great but on the other hand more and more are releasing the product at some point we don't need any more these workflows something like that may happen so in that case do you guys provide yet any uh cleanup tools related to that or do the each own uh the product team need to own the those workers and then clean up later.
>> So it all happens naturally through the configs uh as I mentioned like once you like remove references to that naturally those configs will gen be generated saying that you don't need these workflows tasks name spaces anymore and our temporal operator would like catch that drift and do the cleanup for you.
>> Oh okay thank you very much.
All right, next question over here.
>> Yeah, thanks for a great talk. Uh, so I'm very uh interested in the observability part that you mentioned.
We have very similar needs. So the things like showing how much is the backlog, how much is the average time activity. So can you get deeper like do you make queries to the postgress DB of temporal to get the information or you maintain it separately? So yeah, temporal uh natively provides these metrics in as part of the SDK like uh so they have two sets of metrics. One is the worker SDK metrics which come embedded uh and ready to use and the next one is like the cloud metrics which you have to actually scrape uh they provide a Prometheus endpoint which you can like scrape these like which include so those metrics include actions per second uh request per second many other things which is only visible to temple not to us. So we have a service which is like continuously pulling that endpoint scripping that and pushing those metrics to our observability platform.
>> So is uh you think this thing can be open sourced or pushed back. So temporal >> temporal actually open sources sample code for this. That's how we ended up adopting it.
>> Perfect. Okay. Thanks.
>> All right. Next question from the >> hi thank you very much. Nice talk. And then uh my question is related to operations monitoring health dashboard which you have shown to us over there.
Um would you guys use that neural logs monitoring is that or do you guys have the built-in uh logs over there?
>> Uh built-in logs as in like >> um built-in uh self-hosted uh uh uh monitoring or >> uh that I can't speak to of I can't speak to that.
Uh I can't disclose things uh on that front. Sorry.
>> All right. Next.
>> Thank you. Next question over here.
>> Hi. Uh great talk. Uh so you started with saying the problem statement was that the info was not able to scale uh because the image generation got viral.
I wanted to understand how is the scaling journey in temporal cloud because you still need to sort of scale even though you are solving the dev velocity problem here but the clusters on the temporal cloud still need to scale. So we we work very closely with the temporal engineers like we have weekly meetings with them like uh we have 24/7 support with them like if needed and they're really good at handling these issues like for image gen right like when we started we were not on temporal and when we decided to move to temporal it was like Saturday or Sunday early morning around 3 or 4 because we were like handling all the infra and that's when we actually ped temporal can we get capacity to move all the image chain workloads we are seeing to temporal and the temporal team like wasially offered help and like worked with us to move those workloads and not just like at that point like after that we have been like uh scaling well with those like we actually had two more bigger peaks than what we saw during the launch and nobody like saw any flakiness on the infra temporal cloud has been like working well scaling well for us at that point >> thank you all right we'll take Next question from the aisle.
>> Hi uh great talk. Uh I had a question about the Troxy. So with the introduction of Troxy, you're basically all the starters are now talking to your proxy, right? Not directly talking to uh temporal correct >> via client. So what kind of challenges have you seen with uh scalability andability uh with the introduction of additional layer like that? And another question is are you planning to open source any of the temporal operator or anything? Uh yeah, for the Droxy uh actually when we didn't have the proxy we were seeing issues because like we used like Python workers and Python actually uses a lot of compute. You have to like spin up a lot of workers and we were making a lot of connections to temporal cloud at which point temporal cloud was not able to support those many connections and they had to like keep scaling their network layer for us to like keep increasing workers. So introduction of the app proxy actually helped us scale like and add keep on adding more and more workers without worrying about those limits on the temple cloud side. So we were able to like connection pool like uh manage those things at the proxy and proxy was like written in go which is pretty efficient and we don't really need to like run so many uh like replicas of that per cluster. I I would say like the max we run is like 100 replicas not more than that.
And you had the question about operator.
We haven't thought about open sourcing anything yet for that.
>> All right. Thanks. All right. We've got one more in the back here.
>> Hello. Uh thank you for the great talk.
My question is around the versioning of the workflow and um you know as you know like versioning and determinism is a big problem for maintaining existing workflows and um as a platform that you want to make it obvious that versioning is easy. Um c could you provide more details on how you're handling that? Do you provide anything to the um you know the users on how to make that easy and make sure that version doesn't fail across like different versions? So we don't uh use the worker versioning feature which tempor cloud uh natively provides like mostly because it doesn't work well with the deploy system we use uh at openai but we try to like add more guard rails like uh we spin up replay testing for them like uh like as part of like spinning up the new workflows we'll generate like download the histories like ex existing histories continuously and run replay test as part of CI checks for them. So which kind of makes shows that the workflows are not like workflows are backwards compatible and on the other front like the users can like change inputs and outputs as well.
And with Python it's easy to like just remove a field from your object and cause pyantic issues uh backwards incompatible issues. So what our team did there was like introduce uh golden schema files. So whenever somebody has to like write a new input output we uh they write it at a specific location and we generate the schema files for them and uh if they make a change the CI will just catch them like there and then like saying you made a backwards incompatible change you need to fix this before you can merge the code.
>> All right we'll take one last question here from the aisle.
>> Hi thanks for the great session. My question is more related to the telemetry. Once you release your deployments for consumption, right? How will you persist those uh adoption based consumption history and all? Where are you persisting? Because it is huge amount of data that is going to be generated once it is released. So are you leveraging what type of persistent storage mechanisms?
>> U we are possibly not using any different observability platform. You're just using whatever is available in the market. But I can't disclose what obser observability platform it is. It's uh yeah >> any kind of real-time decisions and analytics and all to track the worker uh metrics and all.
>> We don't do much analytics yet. It's kind of like manual on our front like for our team like or not manual I would say like uh we have alerting based on like different signals we want to catch and like based on that we are able to proactively catch issues. uh we don't yet have any analytics on how different workflows are doing over time and stuff but we do plan to like extend that as the team grows.
>> Okay. Yeah. Thank you.
>> All right. Can we get another round of applause for Shat?
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











