Paimon, an open-source table format from Alibaba, offers an alternative to Cassandra for indexing Iceberg tables by using an LSM tree structure with Parquet files on object storage, enabling efficient nested loop index joins through Flink while maintaining compute-storage separation and eliminating the need for external database systems.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Icestream on Object Store! | Personal Projects With Ex-Google SWE, Ep 26Added:
Hello everybody and welcome back to the channel. This week on JNHL, I'm doing a deep dive into what I'll call the Chinese data stack, becoming an undercover secret agent to investigate all of the open table formats beyond Iceberg. And in this particular week, stepping into the realm of Apache Paimon. Let's go ahead and get into it.
All right, so if you were here last week, basically I built out Ice Stream, which is pretty cool because I've been working on it for a while and trying to think about the best ways to actually go ahead and do this thing.
So, the thing that I basically settled upon was one to do it asynchronously because I find that that makes it a lot easier to not have to worry about a lot of the race conditions that can come up with Iceberg tables and maintaining a secondary index on them. So, when you do something asynchronously, it means that I have the privilege of being able to look at the commit history and make the right choice rather than potentially worrying about two different commits going on concurrently and building the right index as a result of that. So, in my opinion, uh doing streaming in Iceberg asynchronously has made my life much easier as a developer. I think most other developers probably don't care too much that it's asynchronous uh and just prefer simplicity and better architectural design.
>> [snorts] >> Now, I made one pretty crucial mistake, which is that I ditched one of kind of the guiding principles behind the architecture of something like Iceberg in the first place.
And that is that people fundamentally don't like databases these days.
I personally have not been around the block so long that I have dealt with a ton of complications of having to manage databases, but at the end of the day, I recognize that it is another external system that you have to look after, and it is another thing that you have to scale out manually yourself. And so, to that end, I recognize and respect the fact that one of the reasons pop you know, technologies like Iceberg is so popular and even all of these OLTP database ports to object store like Neon or so popular is because you can pretty much reliably trust the fact that your object store like S3 or GCS or Azure cloud storage is going to work. Like it's not going to fail you. They have really good durability guarantees. You're probably not going to screw it up and that's really nice. And so now you can have all the stateless compute that just builds around that. And if that goes down, that's okay because your data's not lost. You can spin up some other [ __ ] and keep using that.
So just as a quick refresher on the design, basically the way that Iceberg async works is that it is continuously polling our Iceberg catalog to see any changes made to relevant tables. It's going to Apache Spark and saying, "Hey, it's some new data files just came in.
Let's go ahead and index those." So it's going to put that data into Cassandra over time. And then if those get interleaved with some equality delete files, it'll say, "Ooh, this looks like an equality delete file that I might actually be able to go ahead and merge into an existing data file." Or or rather I can go ahead and merge it into existing data that's in the index so that I can convert it to positional deletes. And the index is going to help us because of some math that we did uh last episode, which I'm just going to quickly go over one more time. Basically the idea here is that again, you know, you are having relatively few deletes relative to the amount of data in your partition. So taking advantage of an index where you can do a logarithmic scan on the number of rows of data for every single delete is a lot cheaper than loading all of this Cassandra data linearly back into Spark and then executing the join there. That is the one thing you want to avoid cuz there's going to be a [ __ ] ton of data in Cassandra and having to send it over the network to Spark so that Spark can execute some join between the Iceberg data and pretty much all of the stuff that you're storing in Cassandra is inefficient.
So from a performance perspective, this is really nice, right? It actually does get the job done. It meets the criteria of, "Hey, we need an index and we need one that's fairly simple to use.
However, ultimately, I think we still got some pushback. So, I got some pushback from a couple of people. I'm going to come over here and first off, I'm just going to go from the one biggest hater comment, which is basically saying, you know, if you are struggling on getting your data to Iceberg, you probably don't understand open table formats.
Now, I want to disagree with that because I quite you know, I understand open table formats pretty well at this point. Um but unfortunately, a lot of people have looked to use open table formats as their sink for all pieces of data and they don't really want to compromise on the liveliness of that data. And so, that's where you see all of these streaming systems coming in trying to improve that. I think if this wasn't a legitimate problem, A, you wouldn't have seen any acquisition of Moon link to Databricks, which was trying to solve this problem of streaming into Iceberg. And B, I wouldn't have gone to the Iceberg Summit and heard from three to four different parties that they're all basically struggling from this problem. One of them being like literally a director at Google. So, I think it's a problem even at pretty sophisticated organizations and I'm not necessarily going to take this guy at face value. I think his organization probably just has different data properties and while I respect his feedback, I don't really completely agree with it.
What he says is that you need a distributed database with a proper in-memory write-ahead log like ClickHouse, Doris, Pinot, or Druid, which is going to make this problem a lot easier. Now, I do actually want to highlight that he makes a really really good point, he or she. Um using a write-ahead log like you would have in ClickHouse, Doris, Pinot, or Druid is going to help with deduplicating data a lot. However, it's not going to catch every single example of data deduplicate data duplication, right? If you have a write-ahead log and it's in memory and you're using a binary search tree to help deduplicate data, after some period of time, that data is getting flushed. And for very high-volume pipelines, that data is getting flushed flushed very frequently.
So, So, not necessarily going to always be able to deduplicate in your in-memory write-ahead buffer. Uh sometimes you're going to have to admit a quality deletes. It's somewhat inevitable. I agree that there's some amount of temporal locality as rows come in that uh you know, you probably are more likely to catch duplicates of that row coming in. Uh but it's not guaranteed and I think if it were guaranteed for everyone uh then, you know, they would all just be using ClickHouse or Pinot and Druid. So, I responded to that uh basically saying that you know, it's just not an ideal streaming format, whatever. And then the second comment, which I personally found way more interesting uh because I think this is something that I can actually iterate on instead of being kind of just told to [ __ ] off, is the fact that um this guy in particular is really anti-Cassandra. Now, he basically said, "Personally, um you know, have we considered Pinot or plain RocksDB as an on-disk indexing alternative?" So, I basically asked him, "Hey, would it be cool if I use Postgres?" And then he proposed two different solutions that would effectively go um you know, on object store so that you only have to use object store and everything is still files on object store.
Now, to that I basically responded, "Well, no, I don't want to use RocksDB because even though that technically kind of works with what I'm doing, uh it's hard to distribute and uh it's just going to be a lot of code on top of that. But Pinot, I could use." And so, that's kind of what brings us to this video.
So, like I mentioned, there's a couple of downsides of using Cassandra. For starters, when I provision a Cassandra cluster, I'm kind of implicitly using it for its CPU, right? Like I've got this Cassandra cluster right here and it's storing a bunch of data rows in there and it's going to be doing a lot of nested look up joints, meaning that Spark is going to be sending it a bunch of data and Cassandra in turn is going to be basically reading all of that data and looking things up in its index.
But the problem that I have now is that Spark is going to only kind of sparsely be using its CPU resources. It's going to have to burst a lot because when it's reading in data files, that's going to be fairly compute intensive reading those in from Iceberg.
But when it's reading in delete files, that's basically nothing and then you're just wasting a bunch of CPU and all of that CPU is instead being used in Cassandra.
So the main factor that I'm kind of sympathetic to is the fact that you don't want to have two different systems that you have to monitor and provision and deal with. You probably want one system that's centralizing all of your compute. And then you want to another place that's just going to hold all of your files on object store. Now Iceberg is already completely located on object store. So what can we do to basically ensure that our index is also located on file store?
Now to that end, basically the main process that kind of does this right here is going to actually be Paimon.
Right? So what we do in Paimon and this is kind of where the whole like Chinese data stack things comes in is that Paimon is effectively kind of like Iceberg except it has this concept of a primary key table. And when you build a primary key table in Iceberg, it's basically like taking all of the files of a RocksDB database and putting them on object store, which is really cool. Right? So actually Paimon writers when they're first flushing data, they'll actually aggregate it in memory so that they can flush bigger buffers in a memtable.
And then they'll [snorts] flush all of these small little files called level zero files to disk. They'll have all of these level one files that are a result of compactions and then finally maybe they'll get compacted to a level two file.
Now basically it happens to be the case that because Paimon kind of started as a project out of Flink itself that the support for doing a nested loop index join on Paimon is actually strongest using the Flink engine.
Now although myself and probably many of you associate Flink as being a stream processing framework, Flink is also kind of trivially usable as a batch processing framework as well because every single stream is really just a boundless batch. So anything that you can do with a stream in Flink is someone literally drag racing outside of my apartment right now? Chicago is wild, man.
Everything that you can do in Flink as a stream is also doable as a batch operation, including a lookup join on a Paimon table. So, this is something that they added. It looks like probably in 2024. I'm going to zoom in a little bit here.
But, basically the idea is you don't want to have to pull in all of your data into your Flink task manager. You want to use the natural LSM tree SS table structure of Paimon files in order to do efficient lookups from one incoming row to a row in the Paimon table. And that's exactly what we're looking for here.
This is basically, you know, continuing to destructure our Cassandra database so that we can just have files on object store, and then one central cluster that's able to manage all of our compute. And that's what Flink actually does here. It pulls in all the data from Iceberg.
It writes it all as Paimon files, and then now all of a sudden we basically have all of our data that we need on object store and a single engine for doing all of our compute.
So, basically what Paimon does is internally it is this LSM tree SS table structure with using probably parquet files as your SS file your SS table files, but just internally sorted on whatever you choose as the primary key.
And then another cool thing they do is within each partition of a Paimon table, they actually have this concept of bucketing as well. Now, if you look at this commit right here, you can see that this actually only works with fixed bucket tables. So, what that means is in Paimon, you can basically choose to you know, within every single partition have a dynamic number of buckets so that they can be sized more effectively, or you can use a fixed number. The fixed number of buckets is extremely similar to what I was doing in Cassandra. The reason being that I know partitions in an Iceberg table can get super large, and what I'd actually prefer is to kind of break them down a little bit more so that we can join to them very efficiently from our distributed query engine, right? It means there's more parallelism that we can use in Flink or Spark or anything like that to pull in from each of those buckets within a partition and just join on that.
So, it's pretty awesome. The one reason that I am using Flink here, in theory I could totally use Spark. I think that Spark's support for Paimon and doing these nested loop index joins is just not as good, unfortunately.
Like I just don't think this is a use case that is supported just yet in Spark. Hopefully, it will be soon because frankly I probably rather support Spark, but you know, I don't know that anyone really has an issue with Flink if they're using Paimon in the first place because the two of them are kind of so interleaved. But yeah, basically that is my update on this week. The code is actually already up on GitHub. I'm going to start advertising this thing more and then I'll continue to iterate as we go.
But ultimately, this is probably a better design than the Cassandra version, not necessarily because Cassandra is, you know, an unproven framework or I feel like it's going to fail or I feel like it's a bad design you know, like necessarily a flawed system, but it's more so that you know, this allows us to continue to separate compute and storage, which is what people are really looking for in these types of architectures, right? Like to someone who doesn't know anything about Paimon and is more familiar with Iceberg, all they have to know is they have more files on object store and they don't really have to care about how those are actually stored within some sort of database. There's not an extra system to maintain here. I'm not even using an external Paimon catalog. I'm using the object store itself as the catalog, which is something you can do in Paimon, and then I'm having Flink be responsible for performing all of the table maintenance on some sort of asynchronous schedule.
So, with that all in mind, you're basically introducing zero additional dependencies to your existing data stack. And now all of a sudden, you can have these really efficient deletes in Iceberg.
Well guys, I hope you enjoyed this video. Like I mentioned, I would be going quite in depth on the Chinese data stack. Paimon is out of Alibaba.
Frankly, when I look at with documentation, like half of it is not grammatically correct.
And so it makes it a little bit harder for you to actually understand what's going on here. But I'm starting to think I have a grasp of this thing. All the tests are looking really good when I run locally. The integration tests run properly as well. So I'm really excited about this and I'm looking forward to kind of sharing it with more people as we go.
Have a great [snorts] weekend and I'll see you all in the next one.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
🚀 BCS613C Compiler Design | Module 1 to 5 Schema Evaluation 🔥 | VTU 6th Sem 💯 #VTU #bcs613c #exam
Pranavaa-y4y
104 views•2026-06-02











