Install our extension to search inside any video instantly.

Platform Engineering meet-up - May 9 2026
Added: 2026-05-11

431 views112:00:14HasGeekTVOriginal Release: 2026-05-10

A scalable distributed job scheduling system can be built using a shared-nothing architecture with HBase for durable storage, Zookeeper for leader election and coordination, and RabbitMQ for asynchronous messaging, where jobs are partitioned and assigned to specific instances to ensure exactly-once execution and prevent duplicate processing.

[00:13:48]like it provides it stores all the keys in a sorted order like it provides it stores all the keys in a sorted order and it supports multicolumns white column database. So we use HBase, Rabbit MQ and Zookeeper as our uh building blocks for clockwork.

[00:14:10]Let's look into the architecture from a very high level perspective. In the left you will see all the services trying to interact with clockwork.

[00:14:19]We have HBase as our durable store, Rabbit MQ for our messaging for doing any kind of asynchronous operations, publisher, subscriber operation and zookeeper as the brain or the coordination layer. I will cover each and every aspect of the architecture.

[00:14:36]But this is how the highle picture looks like.

[00:14:42]Let's look into the first module job acceptor module. So this is a lightweight module. It's responsible for accepting incoming job request and do storing that job in HBase which is our durable store and provides an acknowledgement to the client. This is a very fast operation generally covered in it in a singledigit millisecond because all we need is to store the job details in HBase and give an acknowledgement. So this gets covered in around singledigit millisecond latency.

[00:15:14]Second is our leader elector. This is where things get tricky because once you write a job to HBase, we will need somebody to figure out who or which instance is responsible for this job. So assuming there are multiple instances of clockwork running, we want one instance to pick that job, not all. Why? Because if multiple instances try to pick up that job then they can um try to do multi like same job can lead to multiple clock call backs to the downstream which is not which is not the requirement. We want to do effective ones. So what we do is we follow a shared nothing architecture. By share nothing I mean we logically divide the job range into subsets like we partition the job range by just assigning partitions to a client. Since clockwork is multi-tenented what we do is we assign now partitions to a client. So let's say a merchant client comes up during onboarding we say hey merchant you will be getting 16 partitions. So each job will be mapped to one partition by simply doing a hash modulo by by the partition count. Now these are just logical nothing to do with uh you know any database partitioning or something.

[00:16:31]This is just a column in the HB saying that which partition this job is getting mapped to.

[00:16:37]Now for each partition we choose an instance as a leader using zookeeper. So zookeeper as I mentioned earlier is our coordination uh like coordination layer cons. It brings coordination or consensus to our distributed system saying that which instance is responsible for which partition. So in this case let's say we have three instances running. So zookeeper every instance will try to write a znode like write a path in zookeeper saying that I want to be the leader for let's say partition one but only one of them can succeed. So the remaining one which has failed will try to you know set up a watch like they will get notified if let's say instance one dies they will get a notification or a watch saying that hey uh something has changed do you want to reacquire a leadership so that's how we ensure only one instance becomes a leader for a partition.

[00:17:35]So once a leader is chosen for an instance uh sorry once a leader is chosen for a partition that leader is responsible for assigning the partitions remaining partitions to other uh instances. So for example here let's say payments has three partitions once a leader has been chosen it checks how many instances are available based on those it tries to assign all the partitions to various instances in a roundrobin manner. So in this example we have three instances three partition. So leader A instance A is the leader. So it choose chose one A as the leader for the partition one B for two and C for three.

[00:18:19]So each partition gets mapped to one uh instance. Here you can see the leader itself is an worker like basically leader responsibility is not just assignment but also it get mapped to one partition. So that's that.

[00:18:34]Now once this data is being written to the zookeeper each instance checks whether I am responsible for any partition or not. If yes then they start like they activate their scans. I will cover what is scan but then they get notified basing uh basically saying that I am responsible for partition one. So they will be the sole responsible for scanning partition one. In this case A will scan one, partition one, B will scan partition two and C for partition three. It's a share nothing architecture. They won't interfere in each other scan. That's how we scale.

[00:19:12]Now when I talk about scanning, how does scanning works? Since HBase stores the keys in an ordered manner and that's a very key benefit that we get out of the box because um clockwork as you know is a scheduleuler. So we want to schedule a job that is happening at let's say 5:00 p.m. first before a job that is to be scheduled at 6 p.m. So naturally we get an ordering and HBS helps uh uh for that. So here you can see the schema on the right side that I have mentioned.

[00:19:46]The schema is client ID, partition ID and time stamp. So each instance is responsible for a partition for a client. So they make an scan saying that hey give me all the jobs whose row key is less than or equal to client ID, partition ID and current time. So this way we will get multiple jobs that are supposed to be eligible for execution.

[00:20:10]So in this way since each instance is responsible for only a subset of a range they won't interfere with others and they will only scan the job that they're responsible for. Since these are all independent scan uh scans this allows you allows us to scale and keep the design scalable. If let's say the traffic increases, what we just need to do is increase the partition count and let more workers scan.

[00:20:42]Now once a scan is completed, what happens next? We push that job to rabbit MQ get an acknowledgement and once an acknowledgement is received, we delete that job details from HP. Why do we do delete? Because if we don't do delete, this scan keeps on happening, right? the same job will be scanned again. So in order to avoid that we do delete the job once we get an acknowledgement from rabbit MQ. Why do we push to rabbit MQ?

[00:21:08]Because this is a classical popsup problem. We want to do our scans as fast as possible and let the actual call backs happen in a different thread. We don't want the scanner thread to block because call backs are asynchronous and they require a call HTTP call to the downstream which can be down where we have to perform retries sidelines and all those retry policy. We don't want the scans to block on that. So we let one thread scan quickly and then push to rabbit MQ once an acknowledgement is received we delete the job details from HBase then move on. This loop keeps on happening every 5 seconds. Everything is customizable. We keep on repeating this scan to find an eligible job without much delay.

[00:21:54]Then now the last piece of the puzzle is rabbit MQ consumer. Rabbit MQ is based on a push model. So when a job is pushed to the Rabbit MQ, it pushes that detail to the consumer. We have clockwork instances running which acts as a consumer of rabbit MQ. They get those details and then based on the job details they make a call back to the downstream get an acknowledgement and once an acknowledgement is received they act the message. So uh rabbit MQ has a property of act and neck where you if you act then the message is deleted if you neck then based on the policy of retry like it can recue the message again. So this helps in case when the let's say the clockwork instance gets down uh after making a call back but before acting. So this gets the ability that the rabbit MQ will recue that message if in case an act is not received. So that's how we get or ensure at least once guarantee that um so we receive an X uh from the client and let's say the client is down then based on the retry policy if let's say they have set up retry or they have set up drop policy they want to discard the message based on that we either sideline the message or push to a sideline queue or drop the message based on their retry policy. So that covers the entire flow.

[00:23:16]Just doing a recap uh payments all our clients quick uh call to make a schedule request to clockwork. What we do is make a simple entry in HBase give them an acknowledgement that covers the right flow. Asynchronously we use Zookeeper to divide our entire range into subset and do follow a share nothing architecture.

[00:23:38]Each job is mapped to a partition. Each partition has a leader. that leader is responsible for assigning all the partitions to various instances so that each partition can be picked up by exactly one one uh consumer or one instance. Once an instance scans that job, it pushes to rabbit MQ gets an acknowledgement and deletes the job.

[00:24:01]Then there are asynchronous consumer that get notified by Rabbit MQ in case an iligible job is pushed to their queue. They make the call back and once a call back is received they give an act which deletes the message from rabbit MQ. Since we don't need any kind of replayability we don't use Kafka. Rabbit MQ works just fine out of the box for us. So that covers the entire flow.

[00:24:25]Let's look into the challenges.

[00:24:29]First of all our traffic because it's heavily busty like many of the clients.

[00:24:35]What happens is they try to schedule a job at midnight like for example they let's say there are million of merchant what the client like the upstream client they just say key hey give me all the call backs for let's say million merchant at the same time let's say midnight so what happens HBS because our scanners are very fast they will just scan push the part eligible jobs into rabbit MQ and just move on they will keep on publishing they don't need an like it's a decoupled publisher subscriber right so they don't need an acknowledgement from the consumer but in case of rabbit rabbit MQ a burst like a million of messages gets sent encued to the queue uh which degrades the cluster we have seen instances where our rabbit MQ cluster has gone out of memory at night because someone scheduled a bus like scheduled millions of jobs to be scheduled at the same time what did we do we move to curum cues so curum cqes are a separate or a dedic a different type of Q recently introduced in Rabbit MQ. This this is based on a raft protocol which actually enables high throughput without any without sacrificing on durability or aability guarantees. Just by moving to Curum Q's, we received around 1.5x more throughput over the previous default lazy Q. We heavily benchmarked our rabbit MQ cluster based on our production scenarios and tried to find what is our breaking point. Based on our benchmarking, we set up a distributed rate limiter. I will cover how we did that to protect our database from going down. We enabled back pressure so that if in case our uh because we can't control our client scenario client traffic we can't ask them to stop kid hey don't schedule at midnight or something we have to make our system resilient enough so that in case let's say billion of jobs happen to be scheduled at the same time we don't want our system to fail that's where the resiliency in platform comes into place so we benchmark our rabbit MQ cluster we figured out our breaking point and based on that we set up a distributed rate limiter.

[00:26:48]Let's look into some real world met uh performance number. Uh this is based on the benchmarking which I did. So on the top you can see the lazy Q uh type and on the bottom is the curum Q type. So on the lazy Q if you see it's the same cluster nothing nothing has changed same three node cluster but the publish rate was barely touching 40k uh 40,000 RPS and if if the consumption rate was constant at around 20k you notice that the publish and the consumption rate are quite flaky this introduces lag right publisher is producing at a faster rate but the consumer is not able to catch up so this is where the lag is coming but in case of curum Q you can see the publish and the consumer act is at constant rate of 60k. So this this clearly demonstrate that QMQ is able to handle the scale much better without introducing lag and nothing has changed.

[00:27:42]Same cluster, same durability guarantee, same availability guarantees. So yes, configuration do matter.

[00:27:49]Now let's look into the distributed late remter that we implemented to prevent our database cluster from going down.

[00:27:57]We again based on the same principle design principles of share nothing state we leveraged that to create a very simple and effective rate limiter since we use zookeeper. So we have like we already know how many nodes are available total number of clockwork instances are available. So based on our benchmarking we know the breaking point.

[00:28:18]So we can easily determine key if assuming there is a uniform distribution of like uniform distribution of injection per node. So we can keep in hard like we can just uniformly divide let's say our injection is 45k keeping a safe number we divide it by number of n and we get a safe uh limit per node then we can use a guava inmemory rate limiter to uh basically rate limit not like basically not publish more than that rate. So guava rate limiter has this property that if you are trying to publish let's say more than 10,000 message which is our rate limit then it will automatically sleep. So this naturally creates a back pressure because the same thread that is trying to emit a message in the rabbit MQ that will get blocked for that duration and the scanner will block because on the same thread if you notice or the go back to the previous slide uh you will notice that in the same thread in this thread itself we are pushing to the rabbit MQ.

[00:29:17]So if the thread that is pushing to rabbit MQ gets blocked for certain time because the rate limit has exceeded then uh the scanner thread will also block because it's in the same thread. So this naturally introduces a very natural back pressure without us doing any complicated uh callbacks or something like that. So this gives us this um back pressure propagation. So it's a very simple and effective strategy without having any single database distributed database to keep a sort of rate limit.

[00:29:47]Um yeah second thing how we scaled RMQ is we fig we found that rabbit MQs are very CPU bound B after certain thread after certain RPS each Q was getting mapped to a single core because Rabbit MQ is based on Lang and the way it works is each Q is mapped as a process and that process is pinned to a core. So we fig we found that per Q like if uh like each Q has a fundamental rate limit or like basically fundamental limit to which we can push uh per Q we were seeing like after 10k RPS the we were not able to get in much RPS because the Q satur like if you look into the STOP and figure out what is the CPU utilization we were figuring only one core is getting 100% rest all cores were empty why because each that core was pinned to a thread or sorry pin to a que and that Q core since the core itself is saturated even though the capacity is there the RMQ is not able to scale. So we introduced or sharded our RMQ. This is not an layer which is provided out of the box by Ravit MQ. We like we we did the sharding from the application layer.

[00:31:02]Sharding is nothing but like we created multiple cues uh for the same client because we are a multi-tenant system. We create multiple cues per tenant. Let's say for merchant we create one Q for payment we create one Q and for each Q we create shard it like basically have five Q's let's say or 10 Q's and the application will randomly choose one Q while pushing so this is just an application level sharding the uh this ensured or this provided us this ability to scale without saturating our CPU now let's look into how did we scale HBS HBase just a quick um like a one minute high level overview of HBase. HBS stores the data in something known as region servers. So what happens is if you create a data uh table HB try to assign that table into a region server and if that data is present in only one region server that region server becomes the hot spot. So in order to avoid that what we did is we pre-splitted tables at the creation time to avoid hot spots during scans. So in the left you can see uh for client one when we created table one in case uh the table one was mapped to region server one that is a physical shard but let's say uh and in the case of client two which was mapped to table two the RPS is 100 and the same client one has 10k RPS. This is like 100 times more RPS. Both are mapped to a same like different region server. So this is naturally not going to scale because the traffic entire traffic will go to one region server whereas the other region server will remain idle. So we pre-splitted the entire table into various independent range similar to our partition policy in case of HBS that we talked about earlier. So this allowed us to distribute the data in various region server and get better IOPS without wasting um the bandwidth.

[00:33:12]Let's look into some of the metrics. Um here you can see our peak RPS has reached around 100k RPS and it's a highly busty traffic. This is probably at midnight uh when uh the traffic reached around 100k and the remaining average RPS is around between 20 to 50k RPS throughout the day.

[00:33:34]This is a metric lag for our top client payments. Here you can see our P99 lag is under 5 second. Uh like it is around hovering around between 2 to 3 second in this chart. uh this is the this lag is there because every time the scanner happens right there was some delay if you note there was some delay in which the scan happens so that's where the delay is and naturally we can't go less than a certain number so our lag is currently we are targeting lag of P99 under 5second we process around 3 billion call backs daily uh so this is a stack draft so back like the original article uh says 2 billion that was last here when we when I initially submitted this draft for the root conf but by that time like within around 1 year we like our 50 traffic has increased by 50%. So from 2 billion it's now 3 billion. So you can notice that uh only there are some clients which are heavily causing that traffic and rest all like it's a 8020 rule 80% of the traffic is done by 20% of the clients.

[00:34:42]So same thing here you can see top three clients contribute to around 2 billions uh of jobs and remaining uh 30 100 clients contribute to remaining uh 20% of the jobs. So yeah that's it for my side. Happy to answer any questions if you have uh I have linked the tag blog uh in here. If you want you can go to the tag blog. Yeah that's all I had in mind.

[00:35:17]Yes. Yes.

[00:35:21]>> Yeah. Definitely. So at that time we started this project in 2017. Back at time there was no such thing as temporal. Right. So that is one thing and based on our scale because our traffic template works when you have a multi-step workflow where you want to do something let's say orchestrate a various DAG if you have a DAG in our case it's a very lightweight we just make an HTTP call back our requirement should be able to scale busty traffic and just make an HTTP call back with durable guarantees. So based on our use case like currently itself temporary is not a right fit because we don't want workflows we just want HTTP callbacks templates much more capabilities than what we need for yeah Yes.

[00:36:24]>> Yes.

[00:36:26]>> Yes.

[00:36:37]Okay. So the question is that if you you're asking the partition count has changed, right?

[00:36:43]>> No partition count or let's say one get removed one get added. Let's say earlier instance ID was mapped to partition.

[00:36:49]Okay.

[00:36:49]>> But let's say in future partition two got removed place was taken by a new partition or partition.

[00:36:56]>> Okay.

[00:36:56]>> Named as partition something like that.

[00:36:58]>> Oh okay.

[00:37:00]All right, fair question. So the thing here is key partition count is tied to the instance like to the table at the time of client creation. So that is fixed. So we don't change because we we can't reduce the partition count. We can only increase the partition count. Now your question is let's say if a partition is 10 we now if at the time one instance was mapped to one partition and during re rebalancing let's say that node goes down that instance like some new instance becomes the leader for the other partition. So there can be a time when the same job can could have been picked by two instances because always there is lag. We can't introduce like say okay the deactivation of partition from one instance versus activation of partition by another instance can be done in a you know consistent manner. So there can be a time when the same job can be picked by two instances. So that is why we need support from the client also to do effective ones like basically each job they can have some sort of an identifier item potency key so that when a call back multiple call backs they receive for the same job they can dup from their side because exactly doing exactly once at distributed systems with 100k RPS is not easy. So that is why we do or guarantee at least once and if the client downstream client provides item potency key then it will be effective once. So I hope that answers the question.

[00:38:28]>> Yeah another question I have.

[00:38:30]>> Okay. So currently having consumers that >> so we keep a static count of consumers.

[00:38:46]We do have autoscaler service that is a back best practice that we follow that if our CPU utilization goes below then we scale down our consumers and if it increases but autoscaler has a like uh as you know autoscaler does not kick in immediately right so we tend to slightly overprovision by 20 30% so that we remain warm and in case there are peaks because we can't know we don't know uh the we can't anticipate the future right so we keep it slightly over provisioned to handle spikes but our currently 20 not more than 20 30 consumers are running. So it's a very minimal stack that you can expect uh 20 consumers. So 20 instances that are do responsible for scanning and pushing and consuming. So everything is done by a same instance like all the responsibilities are handled by the same instance. So we don't because it's just 20 boxes 20 virtual machines 20 app app instances basically. So it's since the infra footprint is so low we have not decoupled architecture like by creating a dedicated publisher dedicated subscriber something like that >> okay >> put in the >> yes >> that for if that goes down.

[00:40:12]>> Yes.

[00:40:14]>> Do you have a durability already since you you're managing the post, right? You don't you're not using any manage.

[00:40:21]>> Yes, absolutely. Yes.

[00:40:22]>> So, just wanted to check how you adding the durability and >> so first of all, yeah. So, the question is how do we add durability in case rabbit MQ goes down, right? Okay. So, the answer is HB is our source of truth.

[00:40:38]Rabbit MQ is just a messaging broker that coordinates for publisher like basically links or couples the publisher and subscriber. So if let's say rabbit MQ is down then the publish will fail since since the publish has failed we won't be able to delete the message from HB. So the data is still present in HBase it will keep on scanning like the same row will keep on scanning in the same loop. So the new job will not be scanned. So there will be a head- offline blocking because we won't scan the new jobs. We have to scan the current block and if RM rabbit MQ is down then yes it's a basically all hands on deck. So be >> huh.

[00:41:21]>> Oh okay. Okay. And the consumer part you >> Yes. Yes definitely. So huh. So I understood. So basically you're saying publisher has subscri got an acknowledgement and then after delet so you're saying the consumer has not received an acknowledgement. Okay. So that has happened. So that is where I mentioned our scenario in which our out of memory the cluster has become out of memory because our injection rate to rabbit MQ was so high that the cluster went out of memory. So in that case since we use curum cqes and by default it provides durability guarantees. So once an acknowledgement is received by the rabbit MQ we use a three node replicated cluster. So each node like each message is replicated thrice. It's not that on just one node. So we can handle two node failures. That is first thing and second thing is if let's say that node is restarted or something rabbit MQ ensure the durability. So only thing is we we have to ensure is that we need to get an acknowledgement back from the rabbit MQ once we got an act this means that it has replicated to three nodes by raft like basically rabbit mq uses raft curum cqes for replication and that handles the consensus for us.

[00:42:28]>> I missed that you managing this whole thing.

[00:42:31]>> Okay sure thanks. I just have one question.

[00:42:36]Just want to understand.

[00:42:40]>> Okay.

[00:43:02]Yes.

[00:43:07]>> Yeah, definitely. So this the the use case that are mentioned here is mostly back like kind of asynchronous but on the payment side payment side also clockwork is heavily used for reconciliation because see 90% of the time payment succeeds no issues but when those 1% cases when the payment has failed right it's stuck you might have been seeing the loop going on in the app. So this is where the clockwork helps in reconciliation. So when a payment you triggered a payment at the same time payment service triggers a call back in clockwork for this transaction after 1 minute. So then after 1 minute clockwork will give them a call back for reconciliation so that they can check the status whether what's the status of this payment. Is this completed? Is this failed? Is it stuck?

[00:43:52]And then they can trigger the reconciliation process. Because if there was no clockwork in the payment reconciliation flow then what would happen is payment would have to do something manually on their side or kind of in a batch like they would have to scan the entire database figure out hey what is the timeout for this transaction which is expensive and which would not work at scale. So in for to do that to to handle reconciliation realtime reconciliation we payment uses clockwork in that path. use some kind of status already.

[00:44:25]>> We don't know that is what we don't know payment because payment uh when you make a payment it goes through a multihop process multiple banks multiple intermediaries we don't know what at which point the payment can get stuck and we want to ensure reconciliation as fast as possible. So that is what it's an optimistic process. We 90% of our payments go through but only those cases where it's stuck matters like the at those places we want to ensure the reconciliation happens fast because we don't want the customers to get stuck.

[00:44:57]So that is the case.

[00:45:00]>> Yes.

[00:45:03]>> Yes. Because it's there are multiple it's not a sync payment does not happens in sync. It is asynchronous multihop process multi-state. So we don't know.

[00:45:12]this question like you have.

[00:45:26]>> Yes. No.

[00:45:39]>> Yes.

[00:45:44]Yes. So if you see the lag which I mentioned earlier, our P99 lag is under 3 seconds. So yeah, call back frequent and we have received like it's a production issue. We have seen when the call backs to payment has been delayed and we have seen production issues happening. But yeah because if the reconciliation is late, if they received late call backs, then payment like the client, the actual clients will suffer because they will keep on seeing the payment stuck like so we want to avoid that. We want the status to get reflected as soon as possible. That is yes. Yes.

[00:46:19]>> Okay.

[00:46:20]>> Okay.

[00:46:25]>> Yes.

[00:46:28]>> No, these are the all the transactions because clockwork does not know whether a transaction has failed at the time of schedule. payment has scheduled a let's say call back saying that give me call back after 1 minute. So these are all the payment processes that has happened.

[00:46:42]Let's say over the day 1 billion uh jobs or 1 million transaction has happened.

[00:46:46]So this each billion is attributed to one job here.

[00:46:55]>> Absolutely. Yes.

[00:46:57]>> Yes. Yes. So end of the day is like this daily merchant settlement uh coupon invalidation and all those things.

[00:47:03]>> Yes.

[00:47:05]saying that you know that >> yes >> slots.

[00:47:19]>> Okay.

[00:47:23]>> So each each of these systems like we are a platform right. So each our upstreams are various applications or various internal teams each would have their own SLA to maintain or do some kind of an offline like they have their own SLA. Yes.

[00:47:39]>> So the question is like do you provide visibility in terms of the cluster is occupied this much. This is the time available.

[00:47:46]>> Oh okay. Okay. No no no. So our job is to make like I understand your point that do you have some analytics or something like that to have a job basically so that cluster utilizes something like that right so people can try to do but yeah we are not that at that state yet our requirement is to schedule that job as like as soon as possible like basically if someone says key uh they want to schedule this transaction or this settlement at 3 p.m.

[00:48:12]anytime like we want to schedule it as fast as possible.

[00:48:16]>> You don't have a categorization of workload.

[00:48:18]>> Yes, it's an HTTP call back. We don't do any kind of long running jobs like a execution of jar or docker file or something. It's an HTTP call back. So, it's a short running job.

[00:48:34]>> Exactly.

[00:48:37]>> Yeah. So, that's the that's the Yeah. So the question is whether like do we have any kind of item potency support out of like from the platform itself. So the answer is as a platform we can't provide item potency because we are there to make an HTTP call back. So while doing a HTTP call back the client can provide certain parameters like post params or some >> absolutely they can whileing a job they can provide hey you can use this key or this ID while making a call back. So that's how they can give we will pass the same thing uh to them during the call back and then they can maintain their own state.

[00:49:15]>> Yeah.

[00:49:17]>> Okay.

[00:49:35]Yes. So there is like yeah we use uh like in phone pay we use a common stack we deploy everything on docker and use deploy it on like we ship it as docker deploy it on virtual machines every the data the machines and everything are self-managed we have our own data centers and for the entire orchestration we don't use kubernetes actually there is a talk happening by our uh chief architect for we have our own in-house drove which is our container orchestration platform for shipping or basically deploying docker containers at a scale. It is very similar to Kubernetes but it's again an in-house platform that we develop for deploying Docker images across our uh data centers. I want to discuss the cost. Yes.

[00:50:25]>> Yes.

[00:50:35]>> Yes. But I have seen you know the benefit to 90% cost savings the moment you start like is there any plan to start such >> Kubernetes? Uh no no currently uh since every like we have our own in-house um docker container orchestration platform.

[00:50:57]So currently there is no such plans for it.

[00:51:02]>> Yes.

[00:51:05]Okay.

[00:51:18]>> Yes.

[00:51:23]Okay. Yes.

[00:51:25]Okay. Sure. So, yeah. Like when I mentioned CPU, that is just one of the metrics. So we do have a lot of metrics because observability for a platform is just not it's just not a single metric.

[00:51:36]We do have like as I mentioned here we have this one also like we do keep track of call back delays right we do keep track of number of call backs that has processed so every time if there is some kind of we maintain the historical analytics as well let's say like basically we have a in-house anomaly detection platform anytime uh the matrix or has not like if there is any kind of deviation from the expected metrics we do get alerts but autoscaler can is also triggered on because this is a metric Right. Sorry.

[00:52:08]>> Yes. So autoscaler can be triggered on lag also. So it's not an any limitation by the platform that it's not you know enabled on what do you call CPU just but CPU is a very good metric for us. In case of see one thing is if we enable autoscaling on various metrics what can happen is noise. Let's say um because of certain metric um there is a noise and then let's say instances scale up then rebalancing can trigger and rebalancing is an expensive operation because it int it creates lag because there are moments when the node the leader instance goes down because it will be deactivated a new instance will be chosen as leader so we don't want autoscaling to happen as much as frequently it should be a kind of an um It should be kind of a re exceptional event when a autoscaling happens unless it's a rolling upgrade or kind of restart process that we are trying to trigger. Do you think consumer lag is?

[00:53:09]>> Yes, it is. It is that that is where we have alerts like anomaly detection alerts that the on call gets paged to see whether you know consumer lag is actual uh is a real issue or not because sometimes let's say consumer is down if let's say payment is down due to some issues merchant is down the lag will be there in that case unnecessary we will scale our instances. So that is why it's not always u you know one metric does not just make sense to scale.

[00:53:39]>> Okayum is not the platform the platform is also consumer >> no no consumer the question is are the consumers also related to platform. So the answer is no. Consumers is a separate like these are separate applications merchant payment which which is like a separate ecosystem >> and how is there >> oh by oh okay okay so sorry I misinterpreted the question. So when you say consumer this is the uh basically this one um yeah so this one right uh so rabbit MQ this is the part of the clockwork itself do you mean this consumer right rabbit MQ consumer >> yeah so this is the clockwork itself like the when a message is job is pushed to the rabbit MQ so this rabbit MQ consumers are part of clockwork itself which actually get the job and then they make a call back to the downstream Yeah. So this is part of the application stack itself >> and what's your >> Yeah. So it bas it depends upon the client right. So in the last line you can see based on per tenant or per tenant we have specified a retry policy.

[00:54:54]So if the client let's say specifies after three retries you want the message to be dead letter. So we create a separate queue in rabbit MQ to create push that failed message in the dead letter Q.

[00:55:05]>> And it's configurable.

[00:55:06]>> Absolutely it's configurable. So the client can specify drop as well. Sure.

[00:55:12]>> Yes.

[00:55:26]Now we have second.

[00:55:35]>> Yeah. Yeah.

[00:56:03]participants.

[00:56:14]Yeah.

[00:56:27]this way.

[00:56:40]>> Yeah. Thanks.

[00:57:08]The display built in. No.

[00:57:47]Nope.

[00:57:59]Heat.

[00:58:17]Yeah. Should I mail it to him?

[00:58:20]>> Okay.

[00:58:40]You can ask.

[00:58:49]>> Should I remove this for now?

[00:59:05]I see. Okay.

[00:59:09]What do I do?

[00:59:15]PPT is here.

[00:59:20]The file I think it should be there in downloads. One second.

[00:59:24]Uh, this one it's a PDF. That works, right?

[00:59:30]But then I need I think PDF should be fine, but I should be able to change the slides.

[00:59:56]from And the problem is Yeah.

[01:05:52]Thank you.

[01:06:01]>> You will need the mic, right?

[01:06:07]>> We have a next So now we have a second and aent systems going to cover systems and all the systems.

[01:06:39]Thank you.

[01:06:40]>> Yeah, thank you. Uh so hey everyone. Uh so again my name is Manish Druidi. Uh I'm an architect in ID5. Uh the title of my topic is designing multi-agent systems at scale. Uh now if you look at AI agents in general right individually in isolation uh they tend to perform very well right. So if you look at all the demos online, all the PC's, they perform very well because if you give an agent some set of instructions, some tasks, uh you know, they work very well.

[01:07:09]Uh the problem arises when you try to glue isolated specialized agents together into a workflow, you know, that is when you know things start getting messy. So what I want to uh basically talk about is some of my experiences while building this system and you know some of the learnings. Uh before I start by a quick show of hands how many of you have actually built agent workflows.

[01:07:31]Uh I just want to understand from you guys you know some of the challenges that you guys have faced and especially on production when you actually you know deploy this on production for enterprise use cases.

[01:07:43]What are the challenges that you guys have faced?

[01:07:48]>> Isolation. Yeah that's a big one.

[01:07:52]Anything else?

[01:07:57]>> Mhm.

[01:08:04]>> And the Murphy's law when it fails, it fails. Yeah.

[01:08:06]>> Yeah.

[01:08:17]These are the few.

[01:08:20]>> Got it. Anything else? Any other patterns?

[01:08:28]>> Sorry.

[01:08:31]>> Okay. It will be obsolete let's say in a few months or so. Okay. Yeah. I think you >> understood. Uh anything else?

[01:08:59]Yeah.

[01:09:18]>> Yeah.

[01:09:22]Correct.

[01:09:25]>> Yeah, I can I can actually ask a programming question on the chatbot.

[01:09:28]Yeah. Right.

[01:09:31]>> Yeah. So, I think I'm seeing a common pattern here. So, hallucination uh I think uh determinism of the output or the response. Uh you know, I think these are some of the challenges I think that we also faced. Uh what we also did is uh I think so we had isolated AI agents in our workflows already. When we started building them together or gluing them together, we faced a few challenges and then what we ended up doing eventually is we built our own framework around uh this. We abstracted a few layers and I just want to talk about you know those aspects in this uh session.

[01:10:06]Uh before I proceed I just want to quickly talk about what we do in IDFI.

[01:10:10]So, IDFI is Asia's leading trust tech platform. Uh, and basically what we do is we help our clients uh ask or answer one fundamental question about uh their users and that is can we trust them right? So, for example, if I'm a bank and if I want to sanction a loan a personal loan to you know some of my users or a user I would want to know let's say uh you know are there any code cases against that user or not. uh I would also want to perform some regulatory checks against that user right so we basically IDFI provides all of those offerings all of those product solutions uh to our clients so that you know they can completely trust their users for different use cases. Uh we have three core platforms. The onboard IQ platform, the Wondrous platform and the preview platform and uh and again each of these platforms have different products and use cases that we have built uh underneath them and all of them are basically powered by our uh API platform and various AI powered systems.

[01:11:07]Uh we operate across different industries. Uh our presence is in different geographical regions. So just wanted to quickly talk about you know uh where we are coming from.

[01:11:18]I'll quickly talk about the problem statement. Now most of the use cases or most of the products that we have in ID5 uh follow a similar uh you know architecture or you know have the similar use case. Uh so any product that we see in ID5 has four different consump four different planes. The first plane is the data acquisition plane. Now the data acquisition plane basically deals with uh fetching data from different data sources. Now these data sources could either be let's say publicly available websites. It could be third party vendors, data vendors that we have partnered with. Uh it could also be uh you know that data could also be derived data that we are you know getting from our internal data that we have stored over the years. Right? So that data acquisition plane in general deals with everything related to fetching the data and then you know getting it from different sources, collating it. The second plane that we deal with is the data injection plane. Again in data injection plane depending on the use case and the data that we have collected and the type of data uh we build let's say custom ETL pipelines right. So what we do is depending on the use case for example if search is my use case I would maybe use elastic search for storing my data. If let's say semantic queries are my use case I would use a vector DB. If I want to let's say make sense of let's say different entities in my database or in my knowledge base and make sense of the relationship between those entities I might use let's say a knowledge graph right so again depending on the type of data the use case the product that I'm building my mode of injection would differ my EDL pipelines would differ uh the last plane uh so I'll skip so the four third plane is basically the orchestration and the intelligence plane but again this is not part of this uh scope the fourth plane is the consumption plane Right. So this is how our users are able to uh you know avail the offerings that we give to them. So uh there are two mode of modes of consumptions usually. One of them could be our uh APIs that we have built. So we have about 170 odd APIs that we give to our clients. Uh and uh we have the UI portals or the different UIUX products that we have built right and again all of them are powered by our search layer that again sits on top of the data that we have stored in different data stores.

[01:13:33]Now again uh even though over the years we have built the these workflows multiple times for different products it is still you know timeconuming whenever we want to go live with a new product or a new use case right uh there are multiple data sources different types of data data could be raw data uh we would want to you know let's say write very complex extraction pipelines a lot of engineering effort is spent in you know building this entire thing together So these are some of the challenges. Uh and also not just tech bandwidth right I think in order for us to go live with a new use case or a new product all the stakeholders across the company have to work for weeks or months just to get one product live. And this is again before we can even make money out of it right.

[01:14:21]So uh product tech uh you know the designers everyone is involved for months just to get one use case live.

[01:14:28]And we had built this uh these kind of systems multiple times over the years to realize that okay there should be some other way an alternate approach to build similar system so that we can go live really quickly. And again we would still have to do all of this but then uh we wanted a way wherein we can you know quickly go live with a new product a new use case get some market share and then figure out you know if it is worth spending a lot more time or not and then uh so the solution that we ended up building and this is basically the crux of uh the topic that I'll be talking about is that we ended up building an inhouse multi-agentic framework from scratch right I'll talk about why We you know ended up building this from scratch and not use any of the existing frameworks uh in one of the latest slides but this is what we ended up building. Uh now in the first point you like especially in the bold ones you will see a lot of technical jargon right so orchestrated tool augmented multi-agent execution framework so on and so forth. But then each of these jarens or each of these terms are actually very important when it comes from a design point of view. Right? So when we were actually planning this framework or designing this framework uh you know we consciously took uh you know some decisions which helped us to go live faster depending on our use case right. The first one is orchestrated uh I think a lot of you guys talked about you know the ingentic frameworks the response or the output uh would not be either structured or deterministic right so for us it was very important that you know we control the entire execution of you know how those agents are executed.

[01:16:04]So that is one. So we built an orchestrated engine or an orchestrated service that basically deals with uh you know coordination of how agents talk to uh to talk together and then how would they you know execute in a given workflow. So that is one uh we have a lot of tools that we have built agent tools uh which are agents leverage uh to you know do the task and uh and I think what the second point is also very important. We did not want we never wanted a fully autonomous multi-aggentic system. Right? Again, for the same reasons that you guys talked about. I cannot for my business use cases, I cannot rely on let's say an AI agent that can you know take its call on its own have free will and then you know generate some output for the users to see right. I wanted full control over you know how my workflow is being orchestrated. I wanted full control of how each agent in my system is actually you know uh working what are the guardrails that each agent would have.

[01:17:01]So we wanted control over all aspects of uh the workflow which is why we basically built uh you know a semi-autonomous agent so to speak right uh and again in production systems obviously predictability for me matters more than autonomy right so which is why we end up making the decisions that we took uh I'll quickly touch upon the architecture of the system and again there are a lot of components here I'll quickly talk about some of the important ones Uh the first one is uh the agent skills component right now agent skills is where uh you know uh all the agent definitions reside. So basically each of the AI agents that we have we define the role of that agent. We define uh you know uh the instructions that we want to give it uh give to that agent. We define the guardrails that that agent should operate uh under or operate beneath. uh we also define uh things like uh if let's say there are 10 different agents right we define uh rules like okay agent A can only talk to B and C it cannot talk to anything else we also define uh which tools the agent would have access to right so all of those guardrails all of those custom instructions uh the user roles everything is defined in the agent skills and agent skills in our system are nothing but markdown files right so that is another aspect to Knowledge repository as a uh component is where we store all of our global context and global instructions. Right?

[01:18:33]So any custom instructions related to a particular use case for example uh domain knowledge uh any knowledge that you know can be shared across the different agents that we have are stored in the knowledge repository.

[01:18:46]Orchestrator is again a very important uh component in our case. The orchestrator basically decides how a particular multi- aent workflow is you know running or executed right uh think of it this way I think the analogy that I like to use is uh in terms of relating AI agents to humans so let's say there is a CEO of a company right now the CEO in this case is my orchestrator the different employees of different verticles in a company are AI agents now each employee of each vertical let's say each lead of each vertical is an expert in in their own domain, right? They can do one task very well and the job of the CEO or the orchestrator is to figure out when to delegate what to which agent, right? So, so what we have done is we have abstracted or we have basically decoupled the intelligence part of my system with the orchestration part. We don't allow agents to you know reason how to orchestrate within a workflow. we do that do it ourselves and then each agent is responsible for reasoning within their own perview or within their own reasoning or within their own guardrails.

[01:19:54]Uh and then the execution so basically uh the way that memory is uh basically shared across the different agents in my workflow is through temporary artifacts that each agent creates. Right? So for example, if agent A wants to pass some information to agent B, what agent A would do is at the end of its execution, it would create a temporary artifact file. Let's say a JSON file or anything else. It will store it temporarily, let's say within the same container and then orchestrator would then decide that okay fine agent B should now be invoked and agent B can take let's say the artifact that agent A had created. So right that is how the memory shared across the different agents. Uh we store state in radius. Uh so that is another aspect. Uh we use CLI command templates for executing uh our agents. So since our agent skills are nothing but markdown files. Orchestrator decides when to call or when to invoke an agent by running the CLI commands.

[01:20:51]And again we use either cloud or Gemini depends on the use case. Uh agent tools is again a very important component here. Uh this is where we abstract the business logic. uh or anytime we want an agent to do some post-processing or anytime we want an agent to interact with let's say internal systems or external systems we basically abstract that entire logic inside callback functions and we expose those functions as agent tools to our agents to use and then obviously in agent skills we also as part of our guardrails we can we basically specify which agents can use which tools uh any questions here so I think I can answer the question at the end of it as well but I just want you guys to go through this uh one more time any questions I can you know answer it at this point >> what is >> so you have developed your own >> or r1 yeah it's a lightweight orchestration service which basically so basically each use case has a separate workflow and each workflow has some orchestration rules that this service defines >> so we have not used some common openour >> yeah I'll come to that I think in the heard last slide.

[01:22:04]>> Orchestrator is a deterministic service.

[01:22:07]>> Correct.

[01:22:10]>> Correct. So it's like if I'm a CEO, I know that okay, the design lead would be handling this part of the project. The technical lead will do this so on and so forth. And then those leads are you know free to exercise their free will to do anything that they want within regards that the CEO has set.

[01:22:34]Yeah.

[01:22:39]>> So this is the framework that we have built for all use cases. Yeah.

[01:22:45]>> No. So I'll talk about I think the others the rest of the slides will probably answer your questions and then I can take the but any questions related to this architecture at this point.

[01:22:55]>> Yeah.

[01:23:07]No. So there are two things here. So the guardrails controller that you see which is part of agent skills that is where we define the guardrails that each agent would have. The knowledge repository is where we have the uh based on each use case the custom instructions uh and some shared knowledge that each agent could use. So for example in a given workflow maybe report generation is something that we need to do. So the report template could maybe go under knowledge repository for example. Yeah >> it's still a boundary but then it's a boundary that I've set for all agents and not particular to one agent. So it's still a guardrail uh you can say but then the agent specific guardrails will still reside in agent skills >> and what exactly >> things like so for example which tools that agent could use number one which other agents can uh you know an agent could communicate with what is the output structure of the response that each agent will generate if the agent is let's say leveraging an external to an agent tool for example what is the retry mechanism uh what is the exec execution, timeout, so on and so forth.

[01:24:18]>> So we are not using any external framework. All of this is built in house. And then obviously for our tech team, we have abstracted this out by building this framework.

[01:24:29]>> Mhm.

[01:24:36]>> Okay.

[01:24:42]Yeah.

[01:24:45]>> Yeah. I think the slide where I talk about buy versus sell or basically rather why we ended up building our own inhouse framework I think should answer that. We'll come to that. Yeah.

[01:24:55]>> One more question.

[01:24:55]>> Yeah.

[01:24:59]>> These are basically third party LLM providers that we are using. So that's it. So for example, if you're using claude or Gemini, that is what it does.

[01:25:07]So not so this is again a very lightweight service. It only interacts with you know let's say your cloud SDKs or Gemini SDK so on and so forth or CLS.

[01:25:15]>> Are you using any service anywhere?

[01:25:17]>> No. Right now no in some other systems yes but not in this framework.

[01:25:21]>> No.

[01:25:40]So all of that is taken care by the consumption plane that we have. So the APIs or basically the request validators all of that is taken care by that. Yeah.

[01:25:54]>> By the orchestrator.

[01:25:56]So the orchestrator always decides if you know the workflow at any given point in time is working as expected or not or if the agents are producing the same or the expected responses or not.

[01:26:15]>> Yeah.

[01:26:17]>> So there are two aspects. So each agent skill would have the guardrail of you know what it should have written as part of the response and just to validate the orchestrator will then you know actually validate the response structure if it contains let's say any pi information or not so on and so forth.

[01:26:33]>> What is the role of response validator?

[01:26:35]>> The response validator will basically what the orchestrator does is it will deal with orchestrating between the different agents. It will collate the responses and then the response validator will reassemble the final response once it has been validated and send it back to let's say the downstream service.

[01:26:51]>> Any questions shall we take after presentation >> works works >> context we have >> yeah so I think uh some of things that I think we had we had to handle as part of platform engineering. So one is all of these workflows since there are multiple agents working together are usually long running right now in this case uh what happens when you know any touch point in my workflow fails. Uh so what we do is at each touch point after the ex successful or failed execution of each uh agent we store the state in radius right and then that is how we know that okay fine if for example in this example if checkpoint one and checkpoint two was completed successfully we store that state in radius and then if you know checkpoint 3 fails we know when we resume the workflow where to start from so the entire workflow won't be invoked from let's say checkpoint A we'll resume from checkpoint 3 or C.

[01:27:54]Again, I think this was something that a lot of you guys were also talking about.

[01:27:58]How do we make sure that the workflow generates the same response given any same set of instructions or data, right?

[01:28:04]So a few things. So one is I think the agent skills that we create or basically the guardrails that we set for each agent. We define the output structure of each agent. That is one. uh so that makes the responses deterministic. Uh secondly, the orchestrator is responsible eventually to you know uh decide whether to proceed with the execution in the workflow at each touch point number one and what to send the response care by the guardrails controller that I spoke about. So each uh agent knows how many times it can retry an uh retry let's say an agent tool call uh what is the expected response and then the orchestrator also decides for each agent what is the execution timeout right so we know like for from this we are able to ensure that you know uh anytime an agent loops the the execution is timed out and then we are able to handle that response gracefully.

[01:30:10]Uh similarly with API calls I think for each tool call or for each API call we have a retry mechanism depending on the use case depending on the agent and depending on the tools that we are using.

[01:30:22]Uh for cost what we have done is again uh we have set all the guardrails everything at the same time we also instrument everything. So I think in ID5 each product uh you know is instrumentation first. By this what I mean is uh we log everything. Each touch point is logged. It is instrumented. Uh we have alerts uh you know in place. So depending on you know the alert uh type and the threshold alerts are always you know triggered. So for example if we know that for a given workflow this should be the output token size or this should be let's say the number of failure scenarios. uh if you know that threshold is breached alerts are triggered automatically so that we know when to you know uh intervene uh you know uh from a developer perspective uh and I think the guards that we have set will also ensure that you know our cost never balloon up so all of that is also under control and eventually anytime you know anything fails in the workflow uh the failure scenario is again handled by the orchestrator so that you know the consumption plane or basically the interface uh that you know our users see they are always able to get you know determined outputs. It could either be a failure response or a success response.

[01:31:37]Uh there are two main execution models.

[01:31:39]So I think uh from a scaling perspective it's again uh a very simple architecture. Each workflow in our system has a dedicated queue number one.

[01:31:49]Uh and uh we use SCA for uh you know autoscaling based on QEP. So that is uh one uh aspect to it and uh since these are longunning task we use cues so that you know uh you know the other systems that we have in place they are not affected by it uh and then uh we have D based execution again there are two aspects to it in some cases we have certain workflows that need to be triggered periodically right for example report generation for a you for a few use cases need to be generated let's say every 12 hours or 24 hours so in such cases we use periodic Microns that is one. Uh the second part is and something that you know which is very interesting that we have built is self-healing agents. Right? So what we have identified is in our workflows in some of our workflows rather there are certain patterns because of which our workflows fail. Right? So in such cases what we have done is we have created self-filling agents where which what they will do is they'll look at you know the failed jobs or failed workflows.

[01:32:49]Let's say in the last 1 hour or so, the self-filling agent will get triggered.

[01:32:54]It will autonomously figure out depending on the code base and the use case and the failure logs what the issue is. It will create an RCA. It will then also create a fix for that uh you know bug. Then it will send it will also use GitLab uh for you know uh creating a master PR or let's say merge PR merge request share the link along with the RCA via notifications to developers. So all of that is done autonomously and again this is not implemented across all workflows but for certain workflows where we know there are you know common uh scenarios of failures this works very well.

[01:33:36]Yeah. So I think this is a question that you know we get asked a lot. Why did we not use let's say something like crew AI langraph and other you know multi-agentic frameworks uh in order to build this? Why did we build this in-house? Uh the simple answer to that is we never intended to build this uh in-house. Uh I think what we did is we had these longunning uh workflows already in place. Then we started figuring out that okay fine some of these touch points or some of these aspects of this workflow can be automated by isolated AI agents. We started building those AI agents. uh those AI agents in isolation performed very well because they were deterministic and we were only dealing with single agents. Then we realized that okay fine why don't we just glue them all together create a multi- aent framework and then you know uh also use it for other new use cases and that is how you know iteratively the single agentic paradigm you know got converted into this multi-agentic framework right so and then you know once we started seeing or rather once we started adding more and more layers of abstraction we never felt the need to use or migrate the entire logic to an existing framework right so that is one aspect.

[01:34:47]Uh the second uh aspect to that is uh I think all of the agent definitions that we have are currently defined by markdown files. Uh and then usually what happens is for a new use case uh we like we don't need any tech dependency for an agent to work. So usually what happens is someone from a product team let's say a product manager who has a domain knowledge and they want to create a new use case they'll simply create a markdown file in the template that we have defined. uh in some cases they could simply leverage the existing tools that we have or we can create custom ones and then they just have to plug those markdown files into our framework and everything else you know works seamlessly. So I think by building this inhouse that has reduced tech dependency and we are able to you know uh play around with new use cases very quickly that also allows us full control over execution semantics. So basically by this what I mean is uh I have full control over how a existing or how a workflow needs to be run needs to be executed what should be the output at each stage what should be the SLA for a workflow so on and so forth right so I full control over it domain optimized performance and then we have added the abstractions that we absolutely need and then anything else that cannot be abstracted you know is taken care by the tech team and then reduce lockin right so like I think you were saying so if I pick a framework right now let's say crew AI tomorrow if they release a new version which is a more stable one I would have to number one migrate my existing code to the new framework that is one aspect and then secondly there will always be a learning curve when it comes to adopting a new framework right so we don't have to do that if we have built this inhouse just to quickly summarize everything that I've covered I think the core principles that we abide by are reliability so we have checkpoints and state position instance in radius number one uh which allows us to also resume from failure points right so uh basically we have implemented durable execution by doing this uh determinism so because of the strong guardrails that we have set for each agent and across the workflow and because of our orchestrator actually uh taking control of the execution by decoupling intelligence from orchestration I think that has allowed us to you know implement determinism Scalability again Q- based execution uh Q- based autoscaling number one isolated QS per workflow allows us to be scalable and then cost control because of all the guardrails our alerts that we are set in place so on and so forth and just to uh I think just to summarize everything I think this was a tweet that I uh read just last night and I think the last paragraph really summarizes very well you know all of the learnings that we have had so far so I think the companies that will win with AI agents will not be the ones with the smartest model. Uh they'll be the ones where engineers treat agents like junior employees with bad memory and worse judgment and build supervision around them accordingly. Right. So I think this approach of building a multi- aentic framework or a multi-engentic workflow has worked very well for us on production.

[01:37:58]Uh thank you and happy to answer any questions.

[01:38:02]>> Yeah.

[01:38:09]Yeah.

[01:38:19]>> So basically the evaluation strategy or let's say evolves of the system, right?

[01:38:22]Yeah.

[01:38:24]>> Yeah.

[01:38:26]So basically what we do is there are two aspects to this. So one evaluation strategy is on the performance of the entire workflow or the orchestrator in itself. Right? So we want to first ensure that the orchestrator is working as expected. Uh by that what I mean is uh given a use case and given an input is the right sequence of agents being invoked. So that is one. Uh secondly we also test for things like uh you know how many times an API tool is or an agent tool is invoked. What is the latency of the entire workflow? All of that is again platform engineering apart from LLM. So that is one aspect. The second part is when we actually want to determine how an agent is performing or let's say how an LLM uh provider is pro performing right so in those cases what we do is we track all of our cost that is one aspect uh and it then we also benchmark different LLM models for a given use case right so for example uh if I execute an an AI agent against versus Gemini and Claude I want to basically benchmark how the responses are for both of them and then I choose the best one for my use case. So that is the second like the third evaluation strategy. And then uh obviously we have als we also have human QAs or rather QA engineers who actually uh you know uh have created test cases that and they actually create or execute those test cases against a certain test set which also helps us to figure out you know if the workflow is performing as expected or not.

[01:39:58]>> Correct. Yes.

[01:40:03]Yeah.

[01:40:06]>> Sorry.

[01:40:07]>> Human. Yeah.

[01:40:10]>> Yeah.

[01:40:25]>> The problem statement.

[01:40:28]>> Yeah.

[01:40:46]Yeah, >> correct.

[01:40:54]>> Yeah.

[01:40:57]>> Mhm.

[01:41:16]Mhm.

[01:41:25]>> Correct.

[01:41:28]The contest that I'm sending is restricted and determined. Yeah.

[01:41:35]>> Correct.

[01:41:39]>> Yeah.

[01:41:49]>> Yeah. So I think the way that we benefited and we were able to solve this problem. So I think the major problem was the last line, right? So because of building all of this from scratch, our go-time life was delayed. Right. So that was the fundamental business problem that we were facing. By building this framework, what we were able to do is that we were able to >> Yeah. Yeah.

[01:42:13]Which step was the most data?

[01:42:17]>> Data injection for the most part is usually the slowest.

[01:42:21]>> Yeah. So writing ETL pipelines I think is still, you know, under our control.

[01:42:25]we have automated aspects of it. I think data ingestion making sense of that data figuring out how to store it in our uh data stores that used to take a lot of time. So we have built agent skills or around that that is one. Secondly uh in order for me to go live with a new use case or for me to launch a new product I don't have to collect let's say kores of records or data points before I can make sense of the data and abstract product out of it. What I can do is I can build a multi- aent framework.

[01:42:57]So basically for for me to power let's say my consumption plane I'll have to have some data uh stored readily available I'll have to then build a search layer on top of it and then you know build custom ETL pipelines so on and so forth. Now what I can instead do with this framework is that given a request I can basically create a deep research businessoriented workflow for me that can fetch all of the details that I need on demand. Right? So that is one aspect. So I don't have to let's say wait 8 or 9 months just to launch a product. I can simply create a workflow.

[01:43:29]I can fetch details on demand or data on demand let's say from each of these data sources write EDL custom pipelines on the fly and be done with it.

[01:43:38]>> Right. So that's it. Yeah.

[01:43:43]>> Correct.

[01:43:45]>> No. So what we do is we have an ETL framework in place already. We know how to write ETL pipelines. code generation is always the easiest part in my view.

[01:43:53]Right? So given let's say the data that I'm dealing with and it's collecting on the fly, it automatically creates an ETL pipeline and gets done with it.

[01:44:03]>> It's an agent. Yeah. Yeah. Yeah.

[01:44:10]>> Correct. For example, just to give you one more example, let's say I want to fetch some data from a publicly available website, right? For compliance or whatever, right? Now in such cases we have defined the navigation and flow of that website and you know the limits that we don't want to reach as part of my knowledge repository and then my agent leverages all of that along with let's say player at MCP to get the details on the fly.

[01:44:36]>> Yeah.

[01:44:40]>> Okay. uh how do you ensure CLI commands are executed securely >> by uh you know giving the control of the execution to the orchestrator so again anytime I want to make let's say an important decision in my workflow I delegate it to the orchestrator and not to an LLM agent so again so I so basically it's like the business logic or the sequence of the workflow or when to invoke what is controlled by the orchestrator so that I don't allow an a like and I don't basically give uh free will to an AI agent so to speak.

[01:45:17]>> Yeah.

[01:45:19]H >> I mean you have some statistics in terms >> so I don't really have an statistic because there are just again I don't so for example I'm not sure if this will make sense but then let's say for a given for one of the use cases where we create let's say a 700 pager report uh the entire workflow uh where there are multiple agents coordinating with each other that takes about 10 to 12 minutes and the cost is around 4 and a half dollars or so. Not again not sure how you know you can relate with it but then depends from workflow to workflow but then we are tracking each cost and then depending on the ticket size of the use case that we are getting from the client we are able to take a call on which model to use for which agent in which use case. Yeah, >> airflow we planned on using it. Uh so I think for uh some of our uh periodic crons and uh for you know uh workflows that needed to be uh triggered periodically we were using airflow. Uh but yeah again it could be any it can be replaced with any lightweight uh you know deck service.

[01:46:30]>> Yeah.

[01:46:33]Sorry.

[01:46:34]>> Skills. Okay.

[01:46:39]>> Static.

[01:46:40]>> Yeah.

[01:46:56]>> Yeah.

[01:47:03]Yeah.

[01:47:06]>> Correct.

[01:47:07]So what we do is so again everything is instrumented. So that is one. So all of the logging events are basically published to let's say lending zone and then we have log stash uh parsers or beam pipelines where we ingest data in let's say click house. We build metabase dashboards on top of it. We also use graphana to monitor the performance and the usage. And combining all of this we basically you know get to know how the system is performing or and with our dashboards we get to know how many failures are occurring on a daily basis if there is an anomaly the alerts are triggered so on and so forth.

[01:47:38]>> So you're saying now performances of the models you are numbering the models on top of the blocks which are >> no no so basically I I did not understand the last part of the question.

[01:47:56]model performances are not predictable.

[01:47:58]>> Yeah.

[01:48:03]>> Correct.

[01:48:15]Correct. Correct. Yeah.

[01:48:18]>> So what understood?

[01:48:20]>> So basically what we do is again we take each AI agent individually. We look at it individually. We try different models uh for the execution of those agents.

[01:48:30]For example uh for let's say report generation. I might not really need let's say set 4 or opus right. I can simply go ahead with using Gemini 3.1 and be done with it.

[01:48:41]These benchmarks are performed at a per agent level given the use case and then we take a call on which model performs best for that use case and then we glue it together.

[01:48:50]>> Yeah.

[01:49:06]Mhm.

[01:49:12]>> No. Agreed.

[01:49:14]>> Yeah. So I think so everything so what we basically this agentic framework is also under a security governance of the other products that we have in identify.

[01:49:23]So the same you know infosc policies apply there as well. So I did not talk about it because >> Yeah.

[01:49:42]>> Correct.

[01:49:43]>> But here you have the right people trying to play around with or game the system, right? So the request validator takes care of all of that. So for example if there are SQL injections happening you flag that of if there is anything else which is so for example all of the requests that comes to this agentic framework is v consumption plane which is the API or the UI portal and we have validations on top of those APIs and the portals. So >> how does it validating?

[01:50:24]>> Just to give you an example, let's say in in a use case we take address from the user and then we perform some property due diligence uh checks and then create a report out of it. Right?

[01:50:33]Now in such cases the address is taken from a portal. There are particular validations on top of that portal. For example, you cannot enter a non city in the city field for example, right? If you are entering any text field which is uh you know let's say a SQL injection, our backend service will perform validation on top of it. So the final response that you get uh that or that the framework receives in the API will be sanitized will be validated and only then you know the agents will be invoked.

[01:51:12]So >> yes. So basically what we do is the CLI what it does is it executes the the markdown file basically the agent along with the parameters against which we want to send it. So for example, if I want to create a report using the report agent, what we do is we invoke the like the orchestrator will invoke the mdown file of that particular uh agent skill that is one. It will also need the data points against which it needs to render the report. Right? So that is passed as a parameter to the CLI command and that template is written uh accordingly.

[01:51:54]>> I was just so am I right in thinking that the CLI templates also template. So then uh you know does that also provide a layer of defense against things like >> so that layer of defense is again provided by orchestrator but then again once uh once let's say clawed or any model once it starts execution in runtime then again we don't really have control of >> correct but the response that we send to the users or you know what we want to do with the output is again validated by orchestrator but then in runtime whatever the llm model does is again something we don't really have Yeah.

[01:52:48]>> Mhm.

[01:53:14]Mhm.

[01:53:23]>> Obviously it So I think that is never the so this was not not the use case for us. I think our use case was very simple. uh we wanted to create a deep research output given some input right so in that case the only permission that we wanted to give to our agents is the kind of tool sets that the agent can you know uh work with and that is part of the guards controller so I think in your use case for your use case I might not even need let's say a multi- aentic framework or a workflow I might maybe need you know specialized agents that can do one task well >> yeah so the layer is behind Correct.

[01:54:11]>> The orchestrator is the one.

[01:54:14]>> Exactly. Yeah.

[01:54:17]>> I have a question on the orchestrator.

[01:54:19]So how do you codify the set of steps that the orchestrator has to orchestrate?

[01:54:26]>> So basically what we do is our orchestration service. If I talk about the implementation uh it's a it's a two-step process. So one is we we have created our own domain specific language DSL which is JSON based where we specify each of the rules the conditions of the execution so on and so forth error handling and then we have the code executor step which basically leverages a custom DSL uh file for each workflow and then runs the execution >> and how does it exactly talk to the agent >> basically by CLA commands. So basically it leverages the markdown files from agent skills. Uh then based on the workflow that we have defined in the DSL it invokes a particular CLI template along with the variables that the uh agent needs for its execution >> and the CLA how does it make?

[01:55:13]>> So what it does is internally each markdown file has defined which tools to use and which if we want to execute code it is part of the markdown file and then the execution happens in the container runtime Python runtime.

[01:55:26]So basically claude has access to Python's runtime in the container. It will run the code if needed. It will invoke a agent tool if needed. All of that is controlled by the markdown file.

[01:55:36]>> Okay. So you spawn uh containerized sand boxes for agent run and then you just close it off.

[01:55:43]>> Correct. Correct.

[01:55:52]>> Yeah.

[01:55:54]something like >> for our orchestration service. Yes. But then by then again like in this case we had already built an inhouse lightweight system which uses our own custom DSL templates and our code executor. So we never felt the need to migrate to temporal but temporal is a perfect use case for using for such kind of use cases.

[01:56:17]interaction.

[01:56:23]>> Sorry for which part?

[01:56:24]>> On the interaction part.

[01:56:26]>> Yeah.

[01:56:34]>> So the guardrails controller number one that is one part and then orchestrator decides the execution time out for each agent. So by that >> like there are like vector database for you can use embeddings and stuff.

[01:56:47]>> Not really for this use case. No we were building that for uh like we started doing that in the original system then we ended up not using that.

[01:56:57]Is there some kind of so that you mention ballooning of m >> there's no quot as such but then we have alerts whenever there is a spike in let's say the token the output tokens and then we take a call rate >> rate limiting in terms of let's say agent tools or the execution timeout but no rate limiting on let's say the token usage >> I think people have more questions so maybe we can take more this offline our networking event after So please >> in ID5 yeah >> itself and July will be in Bangalore. So once again thank you all of you for your time and special thanks to hosting this meeting for Come on.

[01:58:22]And please do give your feedback. It's really helpful.

[01:58:30]WhatsApp is in Bomb and the theme is

#hasgeek #rootconf #platform engineering #phonepe #AI Agents

Related Videos

Computer Science

Agentforce NOW AMA: Build with React and Salesforce Multi-Framework

SalesforceDevs

490 views•2026-05-28

Computer Science

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

aiDotEngineer

450 views•2026-05-28

Computer Science

WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅

LearnwithSahera

1K views•2026-05-29

Computer Science

More tests are always better? How to use AI to identify tests that bring little value

Alliance4Qualification

335 views•2026-05-29

Computer Science

Search Algorithms Explained in 60 Seconds! 🤖💨

samarthtuliofficial

218 views•2026-06-01

Computer Science

People of Game of Thrones using JavaScript DOM

AltCampus

296 views•2026-05-30

Computer Science

Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA

ascensionix

107 views•2026-05-29

Computer Science

🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam

Pranavaa-y4y

104 views•2026-06-02

Trending

Revisiting The Cat Cafe For The Final Time

BenGtalks

3195K views•2026-05-29

Lil bro is a menace 🤣

NotAirJordan

2037K views•2026-05-31

Political Science

My response to the Police

RecklessBen

1496K views•2026-06-01

The Dancing Plague...

HoodieGuyStories

1730K views•2026-05-30