Building an enterprise Kafka streaming platform requires implementing an API-first management layer that addresses Kafka's inherent lack of native multi-tenancy by providing tenant isolation (logical, physical, or hybrid), configuration governance through policies, data security with ACLs and quotas, noisy neighbor prevention via resource quotas, and geo-replication across regions; this platform approach transforms Kafka from a cluster-centric setup into a scalable, self-service infrastructure that reduces operational costs while enabling dozens of teams to collaborate securely.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Engineering Kafka at Enterprise Scale: Building an API‑First Streaming Platform - Kshitiz AnandAdded:
Today I'll be presenting about engineering multi-tenant Kafka at an enterprise scale at Tesco. So before I begin, just I wanted to see how many people have are actually using Kafka over here. Uh and how many of you have got experience in deploying and running platform like let's say Kafka or Kubernetes such a such platform.
Awesome.
Awesome.
You're using MSK. Cool. Okay. So I think all of you will relate to it what I'm going to present. So this is our story at Tesco how we have actually engineered this platform.
And I'll be sharing the journey in terms of the challenges which we faced and how if you have to architect a platform like this or centralize it, how would you go about it? What are the fundamentals and other things is what I'll be mainly discussing.
So before I start, a quick glimpse about Tesco.
So we at Tesco, we are the number one retailer in the UK. And we cater to almost 1/3 of UK food supply.
Uh we are a team of around 335,000 colleagues in and around UK, India, and distributed across the world.
Um we our revenue was like around 6.6 billion previous year.
And one interesting fact is that there were like around 70 million shopping trips per week at our Tesco stores and all other outlets.
Um We care about our planet a lot and we have the target to go carbon neutral net zero by 2050.
Now all these stats which you see over here uh is powered technologically by a team of uh a by team of techno technologist and we are part of Tesco technology.
Uh we are a group of around 6,000 technologists which power uh and which we work across UK, Poland, Hungary, and Czech Republic.
Uh and we do the end-to-end uh platform engineering, um designing softwares uh which power Tesco's um each and every business.
Okay. So, as I was speaking earlier, this is going to be the agenda. We will first understand what was the reality of the Kafka platform uh at Tesco before we actually started centralizing it. What is our architecture? How we have created that architecture? Uh what is a multi-tenancy platform? I think you folks will relate to it a lot because this is not only limited to Kafka, but it's it's it's can be applied to any other platform as well. And finally, I'll conclude with our learnings, impact, and future scope.
Okay. So, let me first start with the pre-platform reality. So, before we actually started building this platform at Tesco, the adoption of Kafka was not a problem at Tesco at all. There were many teams, around 30 teams who were actually using Kafka at a very good scale before even we began solving that problem. There were around 90 Kafka clusters. Uh there were like 100 TB of storage and around 1 billion of events per day which were being produced across these Kafka clusters. So, the adoption of Kafka was not a problem at all. In fact, when we actually um went uh went live with our system, we had uh we we met like 200% of our target because teams were so um I mean um eager to get on boarded because it solves the problem and and I'm sure you folks will also relate it. If you centralize an infrastructure platform, you are going to solve a very good problem at an industry scale.
Then what was the issue?
So, the issue or the problem was about fragmented Kafka expertise. Although these teams were using and running the Kafka clusters, uh the knowledge and awareness about running this platform was limited in few teams. There were few teams who were expert in running this platform. They knew how to upgrade the Kafka cluster.
They knew how to secure the Kafka cluster.
Uh they knew how to build observability around Kafka clusters. But at the same time, there were team who struggled to run these maintenance jobs. In fact, at Tesco, there were couple of outages which were very critical and impacted the business.
And that is when the leadership at Tesco took a decision that we should centralize this platform.
And the idea was that if you are centralizing, it should not be ticket-driven system that you go and raise a ticket to infrastructure team who then provisions, let's say, Kafka topic for you.
So, the idea was that you create a platform which is self-service and it is a API-first platform. So, if you have to integrate it with anything, like let's say that you want to have run a Terraform plan, you can integrate it with the API send to it. It's a Swagger API, you can call it. And now even like in this world of MCP and all, if you want to expose it as MCP serverless, we can do it. So, that's the whole idea of building an API-first platform. And the leadership at Tesco were like very motivated to solve this problem in this way.
And one of the other critical advantage of centralizing such platform is that you take the benefit of economy of scale. So, the capex remain same while the opex reduces. So, if you see our team, we are a team of around six to seven engineers who are running and maintaining this Kafka cluster for the entire Tesco.
So, you can see the opex cost does reduce if you build a platform and if you automate it.
Okay. So, now having understood the problem and how we thought of the solution, now let's go a little deep into why implementing such multi-tenancy was hard. And there are two critical reasons why it is hard. One is the Kafka ecosystem complexity. So, that comes with Kafka itself. Even if you're using open-source Kafka or a vendor Kafka um And the another complexity is because of the complex domain requirements. So, we have like in in during this entire developer summit we have uh gone through many sessions on architecture. We understood that what what is a characteristic of systems, right? So, systems have functional and non-functional requirements. And when you have complexity around your functional and non-functional requirements and you want to architect a platform which is a generic enough to solve all these problems, that is when the challenge comes. And I'll just quickly explain you.
So, when I talk about Kafka ecosystem, so Kafka is not just set of brokers and Kerberos, right? There is like an ecosystem of products around Kafka. You have uh producers and consumers which are part of microservices. Then you have Kafka streams app, you have Kafka connect, and then Apache Flink as well, which which can connect to your Kafka cluster. So, there are different kind of uh integration patterns which you have to think about when you are architecting a platform. And when you're building governance around it, that is also one piece you have to think about.
One more thing about Kafka is that this Apache Kafka is not uh inherently multi-tenant. So, it's not like um like uh Kubernetes where you create namespace as a fundamental first-class citizen and you can architect your um multi-tenancy around it. But what Kafka gives us is all the right fundamentals. So, it does give you uh the your naming pattern. You can devise your own naming pattern of how you name a topic, how you define ACLs, how you define quotas. And that are the fundamental building blocks of um creating any multi-tenant system.
Coming to the domain requirements, if you um see Tesco's um um Tesco's functional and non-functional requirement, uh it's like very complex.
There are teams who produce um events at a very high scale throughout the day.
But there are teams who will produce a burst of traffic for this such like 10 minutes or even 30 minutes. And then throughout the day there will be no traffic produced.
There are teams who are like in an active activist. So there are two regions and they will be active across both the regions throughout the um throughout the day.
Uh but there are teams who will be like in an active passive or even an active standby mode. So how would you solve for all such teams? Uh you have to solve also thinking about the uh the money which goes to run those infrastructure, right? Because we are part of the platform team and ultimately our aim is also to reduce the cost.
Uh there are teams who want low latency.
There are teams who are fine with in fact little bit more latency because that is not part of their business requirements.
I think I would Yeah.
So now having been understood the problem space uh both from the Kafka perspective and the Tesco perspective, let's quickly deep dive into the streaming platform architecture which we have built. So these tenant A and tenant B which you see here are our customers.
So you can think about different personas of uh our tenants, right? There will be activities which will be mainly admin-related activities like you configure a topic, you um you um you create users, you you give access to different users, you create quotas, so on and so forth.
And then there comes the actual production and consumption of the events uh whether it is using a producer or consumer um um like in a Java application or Spring Boot application or streaming apps or even Kafka Connect.
So, all these use cases, if if you see how we have solved it is that there are two different components in our Kafka streaming platform. One is the control plane, which we have built from scratch, wherein you see management API. So, this is what the API first platform is all about, right? So, for all the admin activities, we expose API. So, you have topics for creation creating an creating a topic, then you have APIs for configuring the topic, you have APIs for configuring your ACLs, so on and so forth. You can in fact define your own quotas using our API and then request access for other teams topics.
Then there is data plane. So, what we have decided centrally is that we would not come in between the data plane and the and any producer and consumer. So, that will continue to happen using the open-source TCP protocol of Kafka.
Because if we come there, then it it kind of brings in a lot of complexities.
You have to build either your own library. It does solve a problem, but you have to think about the trade-off, right? So, it does solve a problem, but uh um then how would you maintain and run those libraries? That's an additional overhead. Uh and you have to think it from from trade-off perspective, what is that you are winning, right? If you make it open source, then any other use case even in future can also connect directly using the protocol.
So, let me quickly show you the flows which looks like. So, any domain is connects to Kafka cluster as I was explaining earlier, even production and consumption can happen directly. But uh but if you want to create a topic or configure or do a quota management, you'll not be able to do using the Kafka's open-source libraries. You have to use our management API to perform all these activities and we have created interfaces which actually implement those things on the cluster.
Now, having been understood our architecture, um the problem which we solved, let's now quickly see what is the different components of building a multi-tenant platform. And I'm sure you will relate it not only to Kafka, but to any other platform which you are designing wherein like you are hosting uh or and managing multiple tenants.
So, the first thing is about tenant isolation. Because you are running tenants in a single infrastructure, in a single platform, how would you isolate logically, physically, or even in a hybrid way, how do you isolate the your tenants? That is the first problem which needs to be solved. Then is about configuration. Here in Kafka, it's about topic. How would you ensure that you have proper governance around all the configurations of the topics. So, one team wants a different variation of flavor of the configurations. How are you going to supply that?
Then is about data security. How are you ensuring that your data is secured across the tenants? So, one tenant's data should not be visible to the other tenants. But at the same time, you should be open enough to give read access or any other such privilege so that other tenants can expose their data to outside world.
Then is noisy neighbor. Because you are running all those tenants in in in a in a single infrastructure, how are you ensuring that if one day one tenant goes crazy, how are you ensuring that it does not impact other tenants? That's That's the whole thing about noisy neighbor.
And obviously, geo-replication. Now, in today's cloud world, we can't just rely on single region, right? So, you have to replicate your data across multiple regions just not only for disaster, but for any other needs as well.
Let's quickly deep dive. So, when we talk about tenant isolation, what's what's the different ways of isolating tenants? So, we we know this as a fact, right? Like, one is the physical isolation wherein you give dedicated Kafka clusters to your tenant. Based on the use cases, based on the requirement, you can give. So, here this Kafka icon actually shows you one broker. So, let's say that tenant A has been given one Kafka cluster consisting of three brokers, similarly tenant B. So, this is what physical isolation is all about.
And here the tenant gets like maximum potential of the entire infrastructure, right?
Because it's dedicated to them.
Then there is a logical isolation wherein the tenants are actually sharing the single Kafka cluster. And that gives you the maximum benefit of economic of scale.
Then the third variation is about hybrid isolation. Here what you do is that you'd run and maintain only single Kafka cluster, but you are dedicating a set of brokers to a particular tenant. So, like tenant C and tenant D have got dedicated brokers, while tenant A and tenant B are sharing. And that depends on your use case how you are solving it.
Right? At Tesco what we feel is that logical isolation is what gives you the maximum benefit and much of the use cases should actually fit in logical isolation. Only if there is a compliance requirement or there is like in in terms of criticality of the business, it's it's the it's such that you need a isolation, you should first move to a hybrid isolation and try to solve over there. And if that also does not solve your problem, let's say that you have issues related to compliance, like for example, Tesco consists of multiple organizations. One organization is mainly for data analytics purpose. So, we may want to physically isolate them because their networking pattern also would be different.
Now, let's quickly move to topic configuration.
So, here is this example, like as I was explaining you earlier, right? That multiple tenants can have multiple requirements about what configurations they wanted in their topics. Right? So, here is a typical example wherein tenant one wants a total retention of 7 days, hot storage retention of 6 hours. So, this is One more thing which we have built in our platform is hot storage and cold storage. So, what we do is that we move the events and data after a certain amount of time, let's say 6 hours or 12 hours from hot storage, which is in disk, to a comparatively cheaper version of storage, which is blob storage, like S3 you have got or Azure blob or any any other like um such blob storages. So, here two tenants are requesting to create topics. One tenant wants 7 days of total retention, hot storage retention of 6 hours, cluster type is shared. We are fine with sharing the topics across other tenants. But, tenant two wants a dedicated one, and they want hot storage retention of 1 day.
And a total retention of 21 days.
So, how do we solve that problem is that in our management API, we have policies for each cluster. We have policy for each tenant when they get onboarded to our system. And then we have request level policy and ultimately rules are what it gets executed across each request. And then we decide whether this request is actually allowed or not, whether that tenant was, let's say, onboarded as a platinum customer or gold customer or what was the other onboarding criteria. Based on that, we will either allow or disallow the request. If it gets allowed, then it gets provisioned in the Kafka cluster. So, it's it's in a self-service way, but there are guardrails around it. That is the important criteria over here.
Even though it looks like that, okay, I if I make a request, I will get a topic.
Obviously, you will get, but you may get based on your tier.
Third and other critical aspect is about data security. So, over here, we define the ownership of topic. And whoever is the owner of the topic has got right access to the topics. So, over here, if you see, there are two tenants, tenant one and tenant two. Both have got their producer and consumer.
So, whenever tenant one creates um a topics, uh they get by default right and read access to all of their topics. So, then they don't have to do anything about it. It's It's It's that self-serviceable. That as soon as your topic gets created, you can just start producing and consuming to it. We in fact give you entire um entire connection details in terms of the bootstrap server, in terms of SASL and config, how you um how you connect to our cluster, how you authenticate and authorize. You don't have to think about it. Like we we give it in your response payload as well.
But if If let's say that uh the tenant two wants to produce or write to tenant one topic, it is not allowed by design.
So, we we don't allow tenant two to produce ever on on tenant one topic. In fact, they would not be even able to discover the topics uh like that because they they don't have access. Although in future, we may have a need to uh support discoverability of the topic because of um because of other needs. Um but it is right now not even allowed that you can discover the topics.
But if you want to consume from another tenant, then you can explicitly read access request. So, we have APIs wherein like you can give other tenants uh topic name, and we will be um provisioning There's a workflow which runs wherein like they will get a the owner will get um an email wherein they will have to say that, "Okay, this is allowed." And then they will get the read access on the topics.
So, now this is uh about noisy neighbors as I was explaining earlier, right? So, how would you ensure uh that the traffic patterns and the load of one of your tenant is not impacting the other tenants, right? This is one of the very critical piece to be uh solved over here.
So, what we do is that in all our clusters, we have benchmarked our clusters. So, we know that each broker um What is the bandwidth of that broker in terms of producer rate, in terms of consumer rate, in terms of number of partitions which it can store. Right?
So, we have benchmarked all the brokers and we have benchmarked our entire Kafka cluster also. So, let's say that we have a Kafka cluster, we we did the load testing and we have benchmarked it and saying that okay, this cluster's current capacity is 1 GB per second in terms of producer rate and in terms of consumer rate it is 3 GB per second. It's It's usually like three times is is how you think about, right? Fanning out is like three times you will think from that perspective. So, that's our cluster bandwidth. So, now whenever new tenant comes and they request access, so they will also say that okay, this is my producer rate, this is my consumer rate.
And how they get to know that? This is one of the difficult problem to be solved, right? How would they get to know that I will be producing at this rate for let's say six months. This is a very difficult problem to solve for any application. Even if you are running, let's say, a simple calculator application and and you are you are putting a load to a server, it will be very difficult for you to think about it, right? So, what we do is that we allow the tenants to run various kind of load tests on non-prod environment so that they they can get to know. Or if they are are already on-boarded and using Kafka from before, they will have some kind of observability in-built. Like they will they will be pushing these metrics on their New Relic or or or Grafana dashboard and we get it from there. So, based on that we extrapolate and we give some kind of buffer to them.
At the same time, when they come, it's it's not like we will we will use our entire 1 GB of bandwidth, right? We will always keep buffer. So, we operate at 40 to 50%. So, that if let's say that tenant one requirement was 100 MB per second, we have given it to them, but all of a sudden their business use case changes and they need, let's say 150. So, it should not be that they they would not get it, right? They should be able to get it. And we have like various means at our disposal, right? We can either give them if if there is like an immediate need, we already have a buffer of 40 to 50% we can give from them.
Or or what we can do is that we can just go and add another broker, or we can expand the disk because the produce rate and consume rate usually impacts your disk, right?
One of the experiences which I share of running Kafka cluster is that you have to think from the memory perspective and the disk perspective. CPU utilization is usually on the nominal side if you are not using some kind of like strong encryption pattern or anything like that. It it it would be it would be usually good.
So, that's what we do. So, we we will say that, "Okay, we'll be adding one more broker, or we will add more disk."
That's that's the decision which we have to do. And we have like very mature CI/CD pipelines using which we deploy and make any changes. So, it's it's it's like just another change request at at our site.
So, over here if you see, both tenant one and tenant two onboarded because their quotas are like 100 MB per second and 200 MB per second. And it's it's like 700 MB per second is still left on our cluster, and we can like have buffer of like 400 MB, so 300 MB is still left for us to give.
And that's how you solve like a noisy problem from this perspective.
Obviously, there are other kind of noise which also come from tenants. This is only from the perspective of data produced, but there could be in terms of number of connections also which which they are creating, right?
And that kind of that kind of noise you will see a lot in when you're running a database like Redis or in anything like that because over there the number of connections explodes.
So, so those are other quotas as well which you can set. And uh are also experimenting a lot of things on this area by having by thinking of building a gateway, and I'll be talking about it um uh from the perspective of our future scope that we want to build a gateway, which is like uh um which is on on the layer of TCP. It should understand Kafka protocol because Kafka has its specific TCP protocol, which it follows.
Those And uh the the last building block is about geo-replication, right? As I was uh telling uh you folks earlier uh that in this current world, you can't be running your uh clusters or platform all in one region. So, you have to have another region, not only for disaster, but for any other needs as well, right?
You just have to copy the data or or any such requirements. So, what uh we um uh usually prefer is an active-active setup. So, we run uh region one and region two, which is like an actual uh byte-by-byte copy of uh every data. So, over here, if you see uh different kind of data, which um metadata and the data which Kafka clusters holds are like your ACLs, obviously your events like which uh your clients are producing, your configs, and your offsets.
Um and then uh there is it's a set of your uh Kafka user and quota, which comes out because we have implemented multi-tenancy.
So, any replication technology which you you use today will do uh the copy of the above um above factors like ACLs and others, but they would not be copying the Kafka user and quotas for you. So, that is something which we have built.
So, whenever we have to let's say that there's a disaster in one region, and we want to build another region, it's we can just spin up another region, and everything gets copied to that, even all the ACLs and the users and all all those details.
So, we always remain active-active uh across region. We also have flavors of active-passive. It's not like we don't have because uh it it also helps you in reducing the cost, and we give it to uh a certain set of tiers who don't need active active setups.
Okay.
So, when we are just sharing about the impact which we have built, and I'm I'm sure like you'd also relate with it if if you have to build such platform at an enterprise scale, that within a short span of 1 year, we onboarded around 25 teams. We have 50 active clients running on our clusters.
We manage around 700 topics via our APIs.
Um it's like I mean throughout the day if you take an average of production rate, it will be around 350 MB per second, and consumption will be around 750 MB per second.
So, now quick things about the learnings. As I was like sharing throughout the session as well, the first thing is that Kafka lacks a native multi-tenancy model, and it depends on you how you model it. It's an interesting problem to solve, and it's it's I still feel that it's it is not something which is which which can't be a generic thing, right? You have to think from your use case perspective, your company perspective, or your client perspective, and then solve it.
And the second is isolation improves reliability and security. As I was explaining about the noisy neighbor and different forms of isolation, if you do it, your system will be reliable and secure because one tenant would not be causing any issue to other tenants.
Third is like the platform is greater than Kafka cluster, right? Even though Kafka Kafka cluster is like highly reliable cluster which you can use like in a vanilla form, but but many of the problem of enterprise, the governance, the the reducing the cost, all those things depend on how you architect the platform.
Uh and the last interesting one is ease of green field onboarding. So, teams who have not used Kafka before, and they want to adopt Kafka, it becomes like very simpler. Like you were mentioning about MSK, right? Because the ease of use is very um um a good over there, right? Similarly, over here, you can just you are just onboarding your tenant and then you are calling an API and your topic gets created. So So, that solves like a lot of problem of a green field teams, like who just want to understand whether they want to use Kafka or not.
Now, this What are the future things on which we are uh working? And this is like very interesting. Personally, I'm very interested about all these things.
One is the multi-cloud platform. So, we want to run our uh Kafka clusters across the cloud. Uh we have our own private cloud as well and we we we use multiple public clouds as well. So, we want to have a setup which runs across the cloud. So, if you have to replicate data uh to any other cloud, we should be in a position to do it.
Second is self-healing platform using AI. So, this is something uh which is also quite interesting that you use GenAI to analyze the logs, analyze the different observability which you are seeing. So, we use like New Relic and OpenTelemetry format um for our observability data. So, how can you read and pass it and give it to GenAI so that it it gets auto-healed? Uh I'm sure a lot of you will be able to relate it, right? What we do is that we'll get logs if there is a error anywhere, we get the logs and then we send it to ChatGPT or Cloud and then ask what should I do?
What is the root cause? So, what if all those things happen automatically and then uh obviously, it should be guardrailed uh and then it gets applied to the platform.
Uh third is about integrating GenAI to increase developer productivity, right?
So, how would we expose it as an MCP as I was explaining earlier so that even topic creation configuration also happens based on your use case.
And uh the last is the gateway for Kafka TCP. I I also touched upon this earlier, right? So, how can we create a um Kafka protocol-aware um gateway, right?
Which understands um the Kafka protocol at 9092 port. All right, it is not a layer 7 load balancer which we use. It's It's at a layer 4.
So, how do you do it? And when you go into this problem, it's like a very interesting problem and not many gateways are actually solving it well.
Even if you see outside products, they're not solving it well because because that is how Kafka is actually built, right? The entire entire logic of the connection building and how you understand the message is is put in the broker logic. If you see the source code, so how It's It's about taking that subset of execution and then putting it at a different layer.
So, these are the future scope on which we'll be working.
And finally, before I conclude, um quick intro about our our team, what are the different architectural problem which we solve. So, this Kafka is only one such solution, but as I was explaining earlier, right? That it's a platform, right? So, you could see the pattern.
How we have solved for Kafka, can we solve the similar way for let's say Redis? Can we solve it a similar way for let's say Vault? Or can we solve it in a similar way for load balancers on our private So, this is what we see over here, resource layer. So, resource layer could be any infrastructural component which we want to create a like a multi-tenant tenancy around.
What we expose to our systems and app engineer. System could be any system, your MCP server or or let's say that your telephone script, it could be any system. Then app engineers if they are directly using your UI or your API.
So, it goes via API gateway, as I was explaining you earlier, the management API.
And then we have control plane domain layer and observability. So, for each of these domains, like streaming is for Kafka, cache is for Redis, database could be Couchbase, Postgres, whatever it is.
And then you should have an observability around all these things.
So, that's our fundamental building block of all the platform architecture which we are building at uh Tesco.
Yeah. And finally, um if you're interested, if you like what we are doing, um how we are building the platform, uh then please scan this QR code and uh apply for any interesting opportunities which you find on our career website. That's it from my end.
Thank you.
>> [music] [music]
Related Videos
U.S. Military Just Flexed The Most Dangerous Aircraft Ever Built The F-47
MaxAfterburnerusa
11K views•2026-05-29
Heating Staying On On The Hottest Day Of The Year
PlumbLikeTom
507 views•2026-05-29
발전 효율을 높이는 태양광 추적 시스템의 기술적 원리 #공학 #공정 #태양광 #알고리즘 #재생에너지
찐현장기술
2K views•2026-05-29
직관 및 곡관 배관 결합 고정 작업 #worker #process #fabrication #pipework #clamp
월드촌촌
2K views•2026-05-30
Wire To Wire Connection Trick | Strong And Secure Electrical Joint #shortvideo #wireworks
ElectricianTips-b1h
5K views•2026-06-02
Peterborough to Newark Northgate Driver's Eye View aboard an InterCity 225 - East Coast Main Line
TrainsTrainsTrains
822 views•2026-05-31
AI turbine design: hypersonic cooling leap #shorts #ai #hypersonic
bobbby_rn
671 views•2026-05-31
How Far Can A Tomahawk Missile Actually Travel?
WarCurious
13K views•2026-05-28











