Sirius is a GPU-native SQL engine developed by the University of Wisconsin Madison in partnership with Nvidia, which accelerates analytical databases by leveraging GPU parallelism for data processing. The system addresses key GPU database challenges including limited GPU memory (approximately 10x smaller than CPU memory) and PCIe interconnect bottlenecks by utilizing modern hardware features like HBM memory (up to 288GB) and NVLink C2C interconnect (1.8 TB/s bandwidth). Sirius implements a composable architecture (MICE: Modular, Interoperable, Composable, Extensible) that takes logical query plans from front-end databases like DuckDB and accelerates execution using NVIDIA CUDA-X libraries. The system employs data partitioning and spilling strategies as first-class design concerns, managing data across GPU memory, CPU memory, and disk. Sirius has achieved top performance on ClickBench benchmarks, demonstrating 5x speedup over CPU execution on TPC-H queries and 9x better performance per dollar, with the primary bottleneck being GPU memory capacity rather than compute or bandwidth.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Sirius: A GPU-Native SQL Engine (Xiangyao Yu)Added:
[music] >> I don't give a [ __ ] about my sequel.
I don't give a [ __ ] about Auto Go.
>> [music] >> I don't give a [ __ ] ABOUT CLICKHOUSE.
I'M POSTGRES.
>> [screaming] >> I DON'T GIVE A [ __ ] ABOUT ANYBODY BUT ME.
>> [music] >> It's the end of the semester last talk.
We have Professor Xingyao Yu from the University of Wisconsin is here to talk about Sirius, the GPU native database system he's building for a while. I mean, Xingyao is super well known in databases. He did his PhD officially at MIT, but I will say and I'd like to say that Xingyao also did a second PhD with me at Carnegie Mellon in databases cuz he and I published a lot and then he did a post doc with Mike Stonebraker and then now he's esteemed faculty at uh at Wisconsin hired by Jignesh Patel who who I then stole from from Wisconsin. But that's that's another story. Xingyao, thank you so much for being here. If you have any questions for for while he's giving this talk, please feel free to unmute yourself and ask your question anytime and that way he's not talking to himself for an hour.
And so, Xingyao, good to see you again.
The floor is yours. Go for it.
Thank you so much, Andy.
And I'm so excited to be here to talk about the system we have been building.
Sirius is a GPU native SQL engine.
Okay. So, before I dive deep into the technical details, I want to give you a very quick overview of the system.
So, Sirius is a a GPU accelerated SQL engine.
Uh today is mainly accelerating DuckDB.
And we actually officially announced that GTC last month that Sirius will become a core extension of DuckDB.
Uh but Sirius can also accelerate several other database systems that we're going to talk about later.
So, as of today, Sirius has already hit the number one on ClickBench. So, according to ClickBench, this is the fastest database in the world. Uh and on both the hot run and the combined run.
So, the figure is only showing the hot run. You can see Sirius is actually the take the top two spots.
And for those who don't know, ClickBench is a open benchmark mainly for analytical databases. It's hosted by ClickHouse. And it's pretty well known.
It has maybe 60 plus systems there and like especially the fastest systems.
Uh and Sirius is today developed in as through a partnership between UW Madison and Josh Patterson's team at Nvidia.
There are probably 10 engineers at Nvidia also contributing to the system.
And we also tightly work together with DuckDB and Vast Data who help tremendously testing and contributing back to the system. And meanwhile, we're also working with several other companies who are actively building POCs together with us.
Either accelerating accelerating their existing engines or exploring new use cases.
Okay. So, that's a extremely quick overview.
And here is the outline of the talk today.
I will first discuss why GPU is a great hardware platform for data analytics and why today is a good time to make this happen.
Uh I will then talk about the architecture of Sirius and also some of some of the design principles behind the system.
I'm going to show you a demo of how Sirius works.
Uh and then I will dive deeper into the technical details like the key features of Sirius and some of the design details.
Uh finally, I'm going to show you evaluation and our future plan.
Okay.
So, first um I want to tell you why we think GPU is a great platform uh for data analytics. Right? It's traditionally used mainly for AI, uh but it's also great for analytical databases. So, first, GPU is no longer a pure accelerator anymore. It's evolving into this general purpose computing platform.
And more and more applications are benefiting uh from this this piece of hardware.
And if you think about it, data analytics is actually a natural fit for GPU architecture. Right? Like a GPU is great for massively parallelism where you do very simple but the same operation across probably thousands and tens of thousands of data records. And that's exactly what database is about.
We have a millions to billions of rows in a table and we want to apply the same operation to all of those rows. So, that's actually a perfect match for GPU hardware architecture.
However, um as you probably know, uh people have been studying GPU databases for more than a decade now.
But there has been some challenges both in hardware and software that stymies the wider adoption of of GPU databases.
So, for example, the first challenge is uh although GPU is very fast but its native memory is usually very small. So, it's probably 10 times smaller than CPU memory.
So, that means uh if you have a large data set, you cannot fit data into GPU memory.
And the second challenge is um So, [snorts] although when data fits into GPU, it delivers great performance, but once GPU start to talk to about external world like talking to the CPU memory or talking to storage then this PCIE interconnect became a bottom neck between GPU and other devices.
At least in traditional GPU systems and this is this scenario is improving today, but um but used to be a major bottom neck of GPU databases.
And finally um writing a database from scratch is a lot of engineering effort. Um and writing GPU program is also pretty pretty challenging. So, if you want to build a GPU native database, there's just a lot of engineering complexity you have to go through.
Okay. Um but today, we are seeing GPU data analytics is actually at an inflection point.
So, the hardware and the software trends are making GPU databases more and more viable at scale. So, for for the first challenge, GPU memory capacity is small.
But you can see the the HBM which is GPU local memory is getting larger and larger. It's exponentially growing in size.
And the second hardware trend is that the interconnect between GPU and CPU is also improving substantially. Um so traditionally, PCIE is the bottom neck.
It's improving in speed but not tremendously like not super super quickly.
But there is this new interconnect technology at Nvidia called NVLink.
And there is a specific version of NVLink C2C that connects GPU with CPU and the bandwidth of NVLink C2C is much higher than PCIE. So, I'm going to show you some numbers in in next few slides.
>> [clears throat] >> And finally, for the engineering complexity, this is actually a software challenge. And we are seeing two trends in software that making it much easier to build a GPU database.
Uh the first trend is nowadays, more and more database databases are adopting this composable or composability principle. So, in order to build a new database, you don't have to rewrite every layer in the database.
So, different component components, you can just draw from for example, open source libraries like the front end, the SQL parsing um the query optimizer, you can you can take another one and maybe the execution engine.
Uh you can you can take existing one finally the storage layer. So, each layer is composable. And to build a GPU engine, you just need to replace the GPU sorry, the execution engine, but you don't have to rebuild the rest uh components of your system.
And the second software trend is that GPU libraries are getting more and more mature. Right? So, you when you build even when you build these SQL operators, you don't have to implement every operator from scratch. You can just leverage CUDA libraries especially cuDF.
They implement a lot of libraries already. So, you can just use these libraries to build your system.
>> [clears throat] >> So, just to show you some numbers to give you a sense of um how fast GPUs are or why this is a great fit for databases. So, the the GPU memory capacity has been increasing exponentially. So, now the latest the GPU already have 288 GB of memory.
Which is a pretty decent, right? So, a server with this amount of memory is is not considered too small.
Um although it's still smaller than the the high-end CPU servers, but this is a uh substantial improvement over over years and it's getting getting even better in the in the future.
Uh and if you look at the bandwidth of this memory, this like is it's really really fast and it's insanely fast. Um so, the latest GPU model Rubin has 22 TB per second memory bandwidth.
Um so, it's uh uh it's you can almost read the entire memory 100 times during 1 second like the 288 GB memory.
Uh so, this is this is super fast and a big jump from the previous generation, which was just 2 years ago.
Okay. So, the hot the second hardware trend is the the faster interconnect, right? So, this is the bottom neck I was talking about earlier that traditionally the the connect interconnect between CPU and GPU is PCIe. And the PCIe is is not fast enough, right? So, you can see the PCIe bandwidth at the bottom of the figure.
Um so, it's improving. It's doubling almost every two to three years.
Um but the fastest one is still like uh maybe 2 to 300 GB per second, and which is not yet widely available, but um but it it's probably going to be available very very soon.
But if you look at the NVLink CDC uh bandwidth, that's that's order magnitude higher. So, um the latest generation already reaches 1.8 TB per second. So, this is a GPU can access the host CPU memory at 2 TB per second.
Uh that is even faster than how the CPU can access its own memory.
All right. So, this is uh with this technology basically, you can say uh the bottom neck between CPU and GPU actually disappears.
Um and the same also happens for inter-GPU interconnect, right? So, uh if you use NVLink um to connect between GPUs then the the latest generation give you a bandwidth of 3.6 TB per second. So, this is also like much faster than how a traditional CPU can access the local memory or uh mostly while most widely available at CPU networks performance. So, this is uh um this this means if you have um multiple GPUs, they can talk to each other at this super fast speed, and that means it's much easier to build um the GPU database at scale. It's not just about a single node. You can also build a cluster that delivers great performance.
Okay. So, I also want to say a few words about the cost because um this is one um big pushback that we hear from a lot of people.
Um the GPUs are always very expensive.
Like uh arguably more expensive than um CPUs.
And sometimes it's also in short supply.
So, even if you want to pay, you cannot even get the hardware because um a lot of big players they want to uh they want to get the hardware as well.
So, I want to say there there are two points right about this. Uh two insights. Uh so, the first one is um it's more expensive, but if you can get large speed up like a fact if you if the acceleration is significantly enough then on GPU, you actually can get better performance per dollar.
Right? Because it's more expensive, but it's also faster. So, in the end uh it it may actually be better in terms of performance per dollar.
And this is already true for AI, right?
Like that's why people use GPUs for AI.
It's actually cheaper than running things on CPU.
And it's generally true when workload can fit the hardware.
Um and as I said earlier, um data analytics is workload that fits very well into GPU hardware.
And uh I'm going to show you some numbers later like in the when I show the evaluation. Uh we do have some experiments on the performance per dollar comparison between Sirius and DuckDB.
Okay. So, the second point I want to make is um although AI training, people want to use the latest generation of GPU hardware, but for data analytics, uh we don't have to use the latest generation, right? It's it's actually good enough if we use some older generations of GPU.
Um and those GPUs are much more widely available at a reduced cost uh because I mean the the AI people they will pursue the the better GPU but after the new generation is released, they they they go for the new generation, but older generations um will be there will be a lot of those uh underutilized, and data analytics that are actually a perfect application that can leverage those older GPUs.
Okay. So, uh next I'm going to show the uh Sirius architecture and design principle, but if there's any question, please feel free to ask um about the hardware trends.
Um okay. So, on the left, you can see the high-level architecture of Sirius.
So, you can see we are taking a very composable design, and that's actually our first design principle. We call it MICE.
Uh it's modular, interoperable, composable, and extensible.
Right. So, on the left, you can see on the very top are the systems we are trying to support. Like we we can support DuckDB uh today, and we also have some some support for Doris.
Uh we are going to support StarRocks, and uh this this whole system design also allows us to support other additional systems um in the future.
So, Sirius is not modifying the front end uh of the of your database. So, the database will take a SQL query uh will parse it, and it can run its local query optimization to produce an optimized plan.
Uh it's a logical plan, and then Sirius will take the logical plan from these front end uh systems. So, for DuckDB, it will just directly take the logical plan uh and accelerate it. And then for Doris and StarRocks and other systems, we will translate this into a uh like today we translate to Substrate, which is the open-source query plan format.
And then we we feed that into Sirius.
And the rest of the execution will all be taken care of by Sirius. All right.
So, it's just a fetch the data from the storage uh to local execution, and it will leverage NVIDIA libraries for uh operator implementation, memory management, networking, and also memory uh buffer management.
So, I see there's a question.
Matt, just unmute yourself. Go for it.
Fishing out. Um do you have any way to cost these differently because different join implementations when you go to run them on a GPU could have different costs than what these other systems front ends might have accounted for. So, do you have a way to do that?
Uh that's a great question. Uh so, I think the question is So, because GPU operators has a different cost model, should we take that into account when we build a query optimizer, right?
So, uh you can definitely build a GPU native query optimizer. We are not doing that yet uh because that's a lot of effort. So, we just take the current optimized plan that's is probably tuned for CPU. And we just accelerate that, and that's already give us give us substantial speed up.
So, if you make it GPU native, I'm sure you can see even bigger speed up.
Of course, at the cost of more engineering effort.
Um and at least in DuckDB, I think it's totally doable. Like DuckDB has this extension system. You can extend the system in many many different ways.
Uh for example, you can have your own query optimizer or you don't have to you don't even have to rewrite the entire query optimizer. You can just uh they will optimize some steps, and you you take the optimized plan, you do some further optimizations. So, I think that allows you to build, say, a hardware uh native or hardware aware query optimizer. So, it's I think it's doable.
We we we are not doing it yet, but this this should be doable.
Sure. Thanks.
Uh yeah. So, this this design or this architecture allows us to do job in acceleration uh with minimal modification to the host system. So, actually as I will see in the demo, when you use DuckDB uh versus DuckDB with Sirius accelerated, the user interface is almost identical.
Um so, users don't even need to know that this actually GPU accelerated.
Uh it's built on top of NVIDIA CUDA-X libraries that that's actually save a lot of engineering effort.
Um because these libraries are highly optimized for GPU.
And uh furthermore, we also design Sirius that to so that it's very extensible. So, for example, if you're not happy with uh libcudf operator, like you want to have a better join well, it's actually pretty easy uh to implement that that new join and hook that into Sirius and test performance. And in fact, we are already doing that for certain operators when um maybe we need some special feature or we need um performance for a specific workload pattern that the existing libraries are not good enough, then uh we are already replacing or we're adding new operators.
So, this make it a very easy, especially for researchers. If you want to explore um I mean, different operator optimizations, you can just try that on this as a platform.
So, Subesh asks, um it seems like that Sirius is the only execution engine that will do everything. Is that the case or are there are you letting DuckDB whatever system offload sub plans to it when those upper level systems decide that some portion of the query plan could be could be benefit from using GPUs. I guess is it all or nothing?
That's a great question.
It's today is all or nothing and that's a conscious decision we made.
So let me go through this first and I come back to the question.
Sirius today is a GPU native execution engine. So that means we will try to run the entire query on GPU. We will not try to break it into pieces and like a hot part of it run on CPU or part of it run on GPU. We just try to run everything on GPU.
And if it doesn't work like maybe something we don't support or maybe for whatever reason we have a bug and crashes then we'll fall back to DuckDB.
So so it's all or nothing. If it runs on GPU it will run there otherwise it will completely run on CPU back into DuckDB mode.
So we made this decision because actually in early days when I was doing like a right while I was writing papers on GPU database we actually explored this idea of a hybrid execution engine like we decide which hardware to use for a particular operator in your entire query plan. And we actually published papers on that.
But it turns out that that substantially increases the design complexity because you you you constantly to worry about oh where to run what and because any combination is possible then you you just have to like your communication policy must be bug free all the possible combinations of these operators.
It turns out to be pretty complicated.
In the end we realized a lot of the operators are great for GPU acceleration like even regex. It may sound counterintuitive because regex is supposed to be very slow on GPU but if you do code generation a regex can be pretty fast like we we saw this in ClickBench numbers. It can be pretty fast on GPU.
So even those things can really be accelerated on GPU. I don't know about user defined functions so that's maybe a part which is a little bit easy but if you can do code gen I think even there maybe GPU can be pretty fast. So so that after that we decided okay by moving the entire query to GPU you substantially simplify the design.
Um I can get better performance.
And also you get very good compatibility because it doesn't matter what your database you're accelerating is because you just run the entire query on GPU. So but in in contrast if you take a design where let's say you start with a a GPU engine with the we call this a retro fit in the CPU solution. So you start with a CPU engine and you say oh which part of this query plan I can accelerate on GPU. So in that approach which actually a lot of systems take um it just the the design is more complex.
Sometimes you cannot get all the performance because the you have to be backward compatible for the CPU engine and the CPU engine was not designed for GPU so there may be bottlenecks that are difficult to address.
Um And finally there's a compatibility issue like if you're accelerating existing CPU engine it's harder to accelerate a different CPU engine.
But in our case the the interface is very clean. Just give me the the logical plan.
So in some sense it doesn't really matter what the CPU engine you're accelerating because the query plan should look very similar and we just just run that all or nothing.
Okay so I can I can move on.
Yeah. So this slide shows shows you the the life cycle of a query. It's very simple like in DuckDB you send a query to DuckDB front end optimizer produces a logical plan that got executed by DuckDB.
But in Sirius it's just the same optimizer logical logical plan but that got executed by Sirius.
And whenever the the query cannot run in Sirius we just fall back to DuckDB.
Okay.
So now let me show you a demo. This is a Okay so here I'm only running one query TPC-H query 9 with scale factor 1000. So the total amount of data is 1 terabyte.
I'm going to show you like 1 terabyte run for all the TPC-H queries later but for the demo it's one query.
And the data is stored in parquet format in this experiment.
And we're comparing two instances like the CPU instance is a pretty big one in AWS.
So this this instance has 384 cores almost a terabyte of memory and um like $21 per hour is a pretty good one.
And for the GPU instance we're running on DGX station which is a single GP100 GPU. It's pretty also a pretty good one.
Um yeah.
Okay so here is the demo.
So before I I start to run it I mean this is recorded video.
Before I started run it on the left hand side is is DuckDB.
On the right hand side is Sirius. Okay so in this demo in order to run Sirius you need to wrap your query with this function. You need to say call GPU execution and then you you put your query there.
So the left and right do not look exactly the same.
But this is already fixed in the latest in the latest code base. So this this demo was recorded a month ago.
But in the latest code you don't need to do this. You just have you just have a configuration flag and you just say I forgot the exact term but something like a GPU acceleration on.
And then after that you can just write the the query just like what's on the left.
And and it would it would just work. So that's just much more convenient.
Okay so let's start to run this query.
Okay we start on the left then we start on the right.
Okay so after maybe 2.5 seconds as query already finished in Sirius.
And it's still running on the left.
Okay now it's also done.
It took 16 seconds in total.
And you can see the results are identical. It's actually the same format because I mean as I said Sirius is only in charge of execution. So after execution the results are returned back to DuckDB so so DuckDB will just do the same way to format it and present to the to the user.
Yeah so for this particular query it's about I don't know six seven times faster.
Um right.
I just want to understand the setup is that the parquet files are sitting on some local SSD and for DuckDB on the CPU DuckDB is going to read it in through the file system crunch on it and spit it out. In this case here for Sirius the GPU is going to suck it in from the SSD over NVLink and then crunch on it and spit out the result.
A good question. This is actually a hot run. Okay okay.
So here you can think of this we run it twice and we only report the second run on both instances so the data is already cached in memory.
Got it. Okay. And for for cold run what is the what do the numbers look like?
Um while we recorded this demo last month the cold run was not very fast because back then what we did was whenever we read data from SSD we actually let DuckDB to do it.
Okay.
>> Like DuckDB read the data from SSD and then we convert the intermediate format to to Sirius so it will be a little bit slower. I actually maybe a few times slower.
But we we just fixed that in a PR uh last week I think like that everything is moving pretty fast. So what the way we fixed it is now GPU can directly read DuckDB data files. So we just ask DuckDB for the for the metadata like the catalog. I said okay where the data is.
Okay these are the files you should read and the GPU will directly go read those data. So we bypass the DuckDB internals and that give us almost 10 times speed up for reading code data. So now Sirius directly reading code data is is also faster than DuckDB. But yeah that that will get merged I think pretty soon.
Yeah.
>> Nice. Okay. Thanks for asking asking that Andy because I was going to say like that's like should be almost 90 gigs of line item data coming in and I was confused how that was happening so it makes sense that it was a hot run.
Yeah.
Exactly. So a quick follow-up question on that. The database is it doesn't probably fit the entire database doesn't fit in the GPU memory right? So >> question. Yeah. So this is probably a lot of predicate push down happening in the parquet layer or how much is the how much is the GPU memory and how much of the database fits in that memory?
Uh great question. So, first I don't know an exact number, but um So, when we cache it because we run it twice, the first time when we cache but I mean this query does not touch all the piece of data in the in the database anyway.
Uh but it is the case that for query nine the data does not fit in GPU memory. So, we cache some of the data in GPU and we cache uh some of the data in CPU pin memory.
Like we call that GPU talking to CPU memory is also pretty fast. So, then when we execute we will pull data from both GPU memory and the CPU pin memory.
Got it. Okay.
Cool.
Uh Okay.
Okay, so that's the demo.
Um So, let me let me let me get uh a little bit deeper into the the technical details.
So, Serpens has has many um key features.
All right, so I will I'll just uh go through this uh quickly.
Uh as I said there's when you use Serpens there's no no need to change the host system.
Uh like in DuckDB's case the users uh hopefully will not see any difference when when Serpens is on or off.
Um so, when the GPU cannot run a query we will fall back to CPU execution.
Uh we also we also have we also have data spilling support. Like that's actually very important uh for for GPU database because the the memory is very small.
So, spilling is actually the norm. Like in traditional CPU databases probably some some databases would treat this as afterthought.
But in GPU database this is actually uh like a first class design principle.
So, we have we have a there is a library a video library called Cook Cascade which is actually developed and maintained by the same team. Uh the team that contribute to Serpens.
Um and we use Cook Cascade to manage the the memory. So, you can think of Cook Cascade as a uh GPU native buffer manager.
Uh it's not based on pages though. So, I will talk about some details, but it is kind of like a buffer manager for GPU.
Uh we can read parquet file with a GPU native parquet reader.
And uh uh it Serpens is extensible with new database operators. It's very easy to to try out uh new implementations on Serpens.
Uh we are working on supporting multiple GPUs. So, that has two parts. One is a single node multiple GPU.
That's like a single CPU connected by connected to maybe four or eight GPUs.
Um and another thing on our road map is to support distributed uh a system.
So, this is like multiple nodes. Each node has its own uh GPU a single GPU or multiple GPU and we can also also accelerate that.
Um okay, so technically uh like the one question what maybe this is more research question is when we build a GPU database how is that different from building a CPU database?
Right, so over the decades like we know very well how a CPU database should be built. Like if we have this paging buffer manager a page based buffer manager and execution engine like we know how how it works.
So, what's the key difference or the key new challenge other than the programming language, right? So, uh I think I I I I can think of these following two uh key observations that make it very different for building a GPU native uh SQL engine. The first one is a page based buffer management is not a great fit for GPU.
It was a perfect fit for CPU.
Right? And that's a lot of the CPU data maybe all the CPU databases were fundamentally built on top of paging.
And the page is the unit of accessing disk for example and that's the unit of managing data.
But in GPU uh that's not a great fit. Like in GPU when you process a lot of data you usually just uh allocate a large contiguous memory region and you just process on it and that region can be gigabytes or even 10 gigabytes.
Like a hash table just a a giant hash table is not uh broken down to pages. And actually uh the libraries that we're using like cuDF they don't have paging uh built in. So, every time you do something it's just one chunk of memory and they just directly operate on that.
So, that makes the buffer management a little different.
And the second observation is that um data partitioning and spilling must be first class design concerns now.
Um just to compare to CPU database the GPU memory is much smaller. It's probably 10 times smaller.
Uh so, well as I said GPU memory is still so fast, right? Like even even though GPU accessing CPU memory is also fast, but when data fits in GPU memory is like 10 times faster. It's like 22 terabytes per second. So, you really want to leverage that that high bandwidth. So, you you want to be very very careful like what data fits in GPU memory and what data should be spilled to CPU or disk. Like you have to be very smart making that decision.
Um So, that memory pressure is kind of more important or more significant in the GPU context.
So, data partitioning and spilling cannot be a afterthought.
Uh and it cannot be operator level. Like oh for hash join I I I will implement the spilling and partitioning.
Well, for GPU it has to be like a built in to the execution model. It has it's a first class uh citizen. It's like uh for us it's almost the thing like uh you the first thing you do.
Any any operator you want to do you have to do data partitioning first.
Um because the you wanted the data you want the data to fit in memory and you don't want to consume a lot of memory because of multi-tenancy. Like a single query is not supposed to take over the entire GPU. So, you have to partition data and you have to spill um pretty frequently.
>> [snorts] >> Yeah, so with that I I can um go through uh the design architecture in more details.
Um Uh so, the front end is DuckDB.
But I use DuckDB as an example.
Uh so, it does the parsing optimizing optimization and it will produce a logical plan.
So, the first thing Serpens will do is take the logical plan and do some conversion. Right? For example, adding partitioning uh and adding some spilling uh not not spilling but partitioning to the the query plan and there's some other modifications.
Uh and then we have a task creator uh that will like basically every every node in the query plan will be translated into multiple tasks. Like each task will process a partition of data. Like this partition maybe the partition is like a few gigabytes one or two gigabytes.
And you just a small lot of tasks and they're running parallel.
And uh each task uh can be picked up like by a thread.
Well, that's a CPU thread. But but the task will be executed by the executors.
Right? So, one task will translate to like one executor and we'll use libcuDF to implement uh these executors. Like you don't have to do everything from scratch.
Um and the the intermediate results like uh the input and output of the executors uh those are like those are the buffer management part of the system. So, those are stored in the uh data repository that is managed by Cook Cascade.
And uh uh the data repository also takes care of downgrade. Like we we call downgrade is is just a spilling data from GPU memory to CPU and to disk.
Yeah, so this is the the super high level uh architecture.
It's probably also similar to uh how a CPU database works.
Uh so, in terms of the the plan conversion um yeah, so we we take a DuckDB plan.
Like DuckDB will translate the logical plan into pipelines.
Uh and the Serpens will do something similar, but it will add a lot of these partitioning operators and we have uh we always have a pair of partition and concatenation. So, partition will will split the data and concatenation will uh optionally merge the the output partition. Like maybe after partition you do the filtering and each partition becomes the output of each partition becomes really really small.
Uh then we will do the concatenation to merge them back into uh data chunks with the with a reasonable size. Like we call it data batches. Yeah, so the data batches of reasonable size. So, just uh have a lot of these partition uh and concatenation.
Okay.
>> In in your overview diagram you had you mentioned you had support for substrate.
Uh yes. Can you talk a little bit like the pros and cons of what you've dealt with and is I'm assuming the the workflow of converting a substrate plan is the same as DuckDB?
Yes, uh conceptually is the same. Like that for DuckDB uh originally we also tried to use the substrate even for DuckDB. And that that means the series just have a like one interface.
Um there were some performance issue and also at some point DuckDB was not uh like fully supporting substrate.
But they but then for performance reasons and and for ease of integration we just use uh DuckDB uh DuckDB's own logical plan.
Um if you take substrate like when we integrate with Doris we take substrate and I I mean it's very similar. You take the logical plan just looks differently and then you do the conversion back to Sirius internal format and the rest is the same.
So I the integration issues engineering challenges I totally buy that. But you mentioned performance issues. That's surprising. So you're saying the substrate plan was less optimal than the the the DuckDB logical plan?
I think the plan Um I don't recall this precisely but I think at some point not all their not all DuckDB plans could be translated into uh substrate.
I might be wrong on this.
Um and also we there one more thing we did notice and we never figured out is whenever you use substrate in DuckDB there is always this performance overhead like a few milliseconds.
Got it. Okay. Very small.
And it's consistent. Uh it's small but for us is as big. Like a few milliseconds is actually hurts a lot.
All right. Thanks.
Right. So this is the the plan conversion.
Um spilling with the cool cascade. Um so like as I said the the output of a pipeline is always stored or managed by the data repository.
Right? So uh and it it's probably just as the output of the scan is in the data repository which is already stored in GPU memory and then it immediately got uh fed into the next operator. So this is it can be pretty lightweight.
Um yeah and the data repository will just contain all of these data batches like we call them data batches which is really uh like uh the output of one task uh which is basically one partition of the output table. Right? Each task will process one partition uh of the input data.
So these data batches um uh can have different formats. Right? So because as I said the data batch may be spilled from GPU memory to CPU memory and later to disk. Uh and it can use different format for it. Like if it's already in GPU memory and then usually we just use the CUDA of table format because well that's the what our executor or our operator executor can recognize so you don't need to do any further translation.
But if the data is spilled to CPU memory then we want to do some conversion. Like we want to convert that into fixed size allocation. Like you can think of that as large pages.
Um so that make it easier for the CPU side to manage memory because it's just all fixed fixed sizes.
And the design of data repository also allows us to use custom data representation in the future. Like maybe uh you want to do some compression before like you can do compression even for the cached data in GPU because that will allow you to put more data in GPU memory uh or you can do compression when you spill the data to uh CPU or or SSD. So it's all customizable. Uh and uh like uh we can explore that in the future.
Yeah and the downgrader will periodically uh look at all the data batches in the data repository and use different heuristics to decide uh what data batch should be downgraded and what data batches should stay in GPU memory.
So you don't control any of this. This is all CUDA. Correct?
Uh quick question. Uh All the control decisions are actually happening in CPU.
But all the like the data movement or the processing happen in GPU. So that's all CUDA.
Um but some of the decision like the downgrader which uh data batch should be spilled that logic actually runs on CPU.
And then and then you provide that logic or or like CUDA's doing this for you?
Uh we provide that logic. Got it. Okay.
Yeah we we can that's another thing we can that's another extensible component.
Right now we have very simple heuristics to decide uh what data batch to spill but like [snorts] you can design pretty complicated sophisticated uh policy to decide what to downgrade. And that's all controlled by Sirius. We are not leveraging CUDA to do that today.
Okay. Okay. Thanks.
Yeah.
Okay.
Uh yeah so uh Sirius can also work in a distributed environment. So the earlier I showed only the single node uh but we also build a a prototype with Doris which is a modern composable uh a distributed uh database.
So it's very similar to how it connects with DuckDB. Right? So the query will come to Doris coordinator and then on the left the which is how Doris works the the query plan will be generated the query plan will be sent uh will be fragmented and then sent to all the the backend servers.
Uh each server receives and do the execution. It will do local execution.
It will do data exchange.
Uh so that's how Doris works. So when you uh when we change it to Sirius engine so the front end is the same. It's still the Doris query parser query optimization and that will generate a query plan.
And we just send the query plan to Sirius like if each node like it's also distributed so each node will get its own query plan.
And the Sirius will take care of two things. One is it will um it will do handle the data exchange between different nodes.
Because we we don't want to use the the traditional data exchange because that may not be fast enough. So um to maximize performance Sirius will handle data data exchange using Nvidia library called Nickel.
Uh and the Sirius backend will also translate the query plan to substrate.
So it's still a Doris query plan but we translate to substrate and then I just feed substrate into uh uh the the Sirius engine.
So in this architecture ideally uh which [snorts] I also think that's happening today like we you don't need to make any code change to Doris.
Like just just Doris zero change. I just change the configuration to to tell Doris you should send the logical plan to Sirius instead of its native engine.
But you don't need to do major code change inside Doris.
And you also don't need to make major code change to Sirius like single node Sirius.
So the only thing we need to add is just a at least substrate translation also the data exchange.
So that makes the design very very modular.
Uh okay so let me show you some evaluation.
So this is the uh a Sirius running on TVCH 1 terabyte.
Uh the hardware is a DGX station with GB300.
Um the data is stored in parquet format uh but we're only reporting the hot run.
And this is like running the second time the data is is cached in GPU uh and if data if GPU is not big enough we cache that in CPU P memory.
Uh and we are running DuckDB and Sirius on this same platform. So DuckDB is also running on GB300.
But the GB300 actually have two components. There is the Grace chip CPU and there's Blackwell GPU. So we're running DuckDB on that um CPU on the the GB300 CPU.
Uh that's actually a very good CPU. The the memory bandwidth is higher than a typical uh CPU server.
So on average we see five times speed up.
And if you look at the the sum on the right side um so Sirius can finish the entire run in 21.6 seconds.
So if you're familiar with TVCH numbers the world record right now is actually uh more than this. So this is actually faster than the world record already.
Uh of course this is not the officially audited run. So um not official nothing is official here but at least this shows the how fast it can be. It's really really really promising.
Uh it's like faster than the world record and actually also the system here the hardware here is cheaper than the hardware that delivers the world record results.
Okay so in this experiment we are sweeping the scale factor like the data size all the way from 10 gigabytes to 1 terabyte and on the same hardware also parquet format also hot run and we're showing the speed up over DuckDB with these different data size.
And by the way, I forgot to mention on large data set spilling happens all the time.
Like just the data doesn't fit so we we keep spilling data back to CPU memory. I just did all included in this figure. So in general we see the speed up actually increases as you have larger and larger data which is pretty interesting. Like it's it's not only give you speed up it actually more speed up as you have more and more data.
The scale factor 1000 there's still something we we can optimize so but it's pretty close to the to the peak.
I see there are some questions.
Right. So Chris has asked Chris asked are the queries run sequentially?
And then the next question is how does Serres handle data skew?
Yes, the experiment runs sequentially so one query at at a time. We are working on multi-tenancy though so the architecture allows multi-tenancy. It just needs time to to get it stable.
The second question is Maybe for the for the sequential queries like say the ones that that take a long time like Q21 like what What is the what is the utilization of the GPU? Is it like are you blasting the query out on all possible warps and threads so that like it's running 100% utilized or is it is it a large chunk of it still idle cuz it's just so fast?
I think I don't have that data for individual queries but I guess the GPU utilization should be pretty high.
Partly because the data is very big so there's a lot of data and also if you don't fully utilize it you are not going to get this performance so But I mean it's pretty high. At some point you have to coalesce things right? The concatenation piece that were like you know the there's certain pipeline breakers that have to run single threaded but I I think the next question is how does Serres handle data skew?
Um >> [snorts] >> So we do we do use similar techniques as DuckDB like min-max filtering.
Um I I don't know whether that's the that's what you mean by skew. So min-max filtering will actually prune a lot of data we don't need to process everything.
And there may also be skew in terms of duplicate values.
Like you you join two tables but actually a lot of duplicates so when you actually do the join you suddenly get a quadratic number of rows.
Um We also have solution for that but that's just a uh like I don't know whether that's the question but we we also have solution for that like you have partition data into smaller pieces and then you just like concatenate outputting together like it become a huge output but but it can work.
>> [snorts] >> Okay.
Yeah, so the next experiment is cost normalized. So because people complain about the cost like GPUs can be very expensive. Well, not all the GPUs are super super expensive like in this experiment we are running this is not 1 terabyte anymore this is 300 gigabytes.
Um so we are choosing Okay, you can decide whether it's a fair comparison yourself.
The CPU instance is from AWS it's about $2 per hour.
The GPU instance is from Lambda Labs.
Okay, it's different cloud.
Uh as G200 is also $2 per hour.
And same data format warm run.
And here we show that um the normalized performance like performance per dollar Serres is actually nine times better.
Yeah, well of course this these two hardware is like two different clouds like G200 is not the latest generation G200 yeah it's but it's pretty cheap.
Um yeah, but this is the number we got on these two piece of hardware.
All right, so Wes asked is the memory overhead from intermediate results a concern when running on GPUs? Are you modifying the operators or execution plans to decrease the that overhead?
The memory overhead is definitely one issue and the way we handle that is we just spill the data a lot. Like the intermediate result we spill that back to CPU if we think it's not going to be used immediately.
Um so that's the way we handle that.
Like spilling is a first class citizen in Serres.
The intermediate results are uncompressed right?
Uh that's a configurable. Today we don't compress it just because we we are we're not there yet but you can compress it.
You can say whenever I spill I will first do lightweight compression and then I spill. Like GPU that is a trade-off like lightweight compression still takes some computation but you can you can it becomes smaller to spill.
Um yeah.
But I mean maybe another way to say it is like you're you're you're compute bound on these queries right? Like the the the bandwidth is so high that you can shove things in and off.
Now you're reading from memory so you you can get things spilled out of memory very quickly but if you have to write to disk that's the bottleneck.
Right?
Uh if you really got to disk yeah that can that can be slow depending on your hardware.
Getting to CPU memory is not too bad.
You see your memory is pretty big actually. So getting spilling the data to CPU memory is not too bad.
Okay. And actually your your previous point we we realized the major bottleneck is probably not compute not memory bandwidth. The major bottleneck of the whole system is memory capacity.
That's a little surprising. It's GPU memory capacity is the number one bottleneck. So you actually design the system around that you just say oh what data should stay in GPU memory and what data should spill and that really affects performance in a very big way.
>> [snorts] >> But I mean but but related to that is like because you don't have enough memory you got to spill spill to somewhere else and that becomes you know that's essentially the bottleneck how fast can you get the things out of memory to to reclaim the space. Yeah, kind of yeah.
Okay. All right, keep going.
Okay. Yeah, so this is some experiments on the distributed setting.
Uh we compare Serres accelerated Doris with Doris native Doris and the ClickHouse.
And yeah, it's four node cluster and we show that depending on the query I mean this is only three TPCH queries with scale factor 100 and we can get a speed up between 2.4 and 12.5. We're actually working on accelerating the Doris so I think this maybe we'll have a more stable version within a few weeks.
Yeah, and yeah as I said earlier um we we run ClickBench and Serres is actually now the world number one in ClickBench.
On hot run you can see on the top and also combined run. The combined run they consider both the hot run and the cold run and the loading.
So because the hot run is so fast and the cold run we we also optimize a little bit so it becomes the number one in both. I think if you go to ClickBench website today I checked a few minutes ago before the talk so it should still be there it's still the the number one.
And there is a Nvidia developer blog post talking about this this whole thing so if you are curious about the technical details feel free to to read the blog post. And one more thing to note that is we achieve this performance actually using cheaper hardware.
So this is we're running on G200.
With the G200 is really cheaper. The other experiment in the second place Serres we are running on H100. And that one is actually AWS and that is similar cost not expensive but but similar cost. But G200 is like three times cheaper.
Um And the main difference is G200 has this NVLink C2C connect interconnect but but the H100 is they are using PCIe so Yeah, finally some future plan. We're actively developing the system is moving pretty fast and it will become a core extension of DuckDB. We are working on multi-node distributed disk spilling like right now spilling mainly for CPU memory and we need to add disk.
Adding remote data sources um better data format and compression etc. So a lot of things happening.
Uh this is a major partners of this project are like Nvidia, UW Madison, DuckDB and Vast Data and we're welcoming contributions from both academia and industry.
Um Yeah, finally um this is my final slide summarizes everything. We actually have a list of papers. So if you want to check out maybe you can look at the first two. The first paper talks about Serres in general. The second paper is actually from Thesis the the Voltron data people they published this paper and a lot of those engineers that are actually working on Serres now at Nvidia.
We have website GitHub page and here is the QR code to join our Slack channel.
You can also email us.
Yeah, I'm happy to to take more questions. Thank you very much.
All right, Shinya, I will clap on behalf of the audience. That was phenomenal. We have time for a few questions. If you can ask your question first, go for it.
Yeah, Shinya, really nice talk. I I know the Voltron data guys worked on a bunch of these things before.
>> Oh, yeah. Um and it's so awesome to see this come through and really amazing.
A question is uh you talked a little bit about the data spilling and how important that is.
If you're in the middle of a uh join tree and you have to decide on the partitioning that you need to do for the next join in the stage, uh there because it is very data dependent, if you end up with a partition that is really really large, this is I think the skew question that Chris was maybe trying to ask. Mhm.
How do you deal with that? That that was a huge nightmare uh uh when the Voltron guys were trying to attack that. And is does the CPU spilling take over and start to deal with that? Or like how far or is it new set of algorithms when you start to have one partition just overflow by a lot? Cuz then you have to deal with it in the next point either by repartitioning or doing something else.
So, uh can you say a little bit about how you deal with that? Because cool cascades won't handle it, right? So, you'll have to do something outside the database engine.
>> Uh great question. So, the honest answer is I don't think we have 100% complete solution on that yet. Like this is uh ongoing.
But, I would say one possible solution is as I said, we have this partitioning operator everywhere.
Uh and the partitioning and concatenation are inserted everywhere. They may not be doing the work. Like maybe we may decide just to skip it.
But, they are there as placeholders. So, I think what you said is if the intermediate suddenly got so much bigger, uh we have a partition operator right there.
And we just partition that into smaller pieces.
So, it's always the partitioning and concatenation come in pairs almost.
Like we just say, oh is it too big? We partition. Is it too small? We concatenate.
But, if it's right size, we just skip these two things. We just do nothing.
Got it. Great. Thank you. Awesome work.
Thanks.
Uh question from the audience.
All right, so I guess my two last questions are um I guess my my one question I have I think you sort of talked a little bit about this when you had these sort of custom logic that you would implement in the CPU that you then plug into uh CUDA libraries. But, I guess have there been any other challenges or what have been the challenges or deficiencies you found in the CUDA library where you wish that you could have full control over something in Sirius, but you have to offload it to CUDA because it's some proprietary Nvidia thing?
Oh, that's a very good question.
Um Oh, by the way, our video libraries they're all open source.
Uh so, well, it's probably take it's probably not easy to convince these people or propose a change that they accept. But, um it's actually possible to to make changes.
I think I cannot think of a architectural problem there.
I think it's very nicely designed and at a high level, I don't think there is the issue. Maybe if you say, oh maybe the memory management is an issue because they they don't allow data spilling. Well, but the cool cascade kind of solved that problem, right? So, it's very it fits really nice together.
Uh I think the problem we encounter are mostly for individual uh operators.
Like uh we don't have a problem with the join, but you can assume, oh maybe the the regex like for example, we saw the regex operator is very slow.
Uh so, we have to do code gen, which is like a different a different tool, different library. We use code gen to to make that fast.
And some other operators like maybe interesting like for example, the distinct aggregation with distinct.
That's very slow.
Yeah. Okay. Why? Like why? Oh, well, that's for some detailed technical reasons.
Um So, you you can potentially accelerate it. Like strings, sometimes strings can be slow.
It's not fundamental. It's just a uh it it can be improved.
Um but, I don't see problem at that architectural level.
But, it it would be improved in the CUDA library. So, in the same way if like if Amazon makes S3 faster, all these systems that are built on top of S3 get that for free in quotes, free in quotes.
The same way if you then give feedback to the CUDA people and say, hey look, this sucks, make this better, then you don't have to really potentially change any part of the core system of Sirius Sirius. Like you just get that for free because you get you upgrade to the next version of of CUDA.
Exactly. So, we actually doing that. We are working closely with the uh with the CUDA app team.
Mhm.
>> Um So, the way we tested is when we have a new idea, we usually test it on Sirius first because it's like it's very convenient.
Uh we try this new operator. We replace the previous operator. We show the speed up and then we use that to talk to them.
Like maybe they're interested in integrating this into uh their code base as well. I think this this is already happening. Like you already see this pipeline. Um so, in that sense, Sirius is like a test bed for CUDA app before they more officially accept um the optimization. Got it. All right, so my last is one quick question. Uh is the distinct slow because the GPU wants GPU warps to need to sync on some global state?
Um I You can partition distinct, right?
It's not Yeah, it's not fundamental. Like I'm I'm sure um some smart people can go ahead and say, oh yeah, this this shouldn't be slow. Like I would implement the distinct that it is fast. I think it's totally possible. It's just a it's just like uh it's not there yet.
Got it. Okay. Or maybe it's already there today, but at least the back then a few months ago that was that was one bottleneck.
All right, my last question and this will be sort of tied into the the overall theme of the seminar series.
Um Obviously, this is written from scratch on CUDA and then you you you linked it in with DuckDB. At any point did you consider integrating this with Postgres? And if yes, why didn't you why did why did you pursue DuckDB instead of Postgres?
We actually had a long debate at the beginning of this project. Okay. Which one to choose?
Um In the end, we chose DuckDB just to >> [laughter] >> reason um one big reason is that I I it's a it's easier to work with the code base.
It's just shorter, right?
Um Yes.
That's actually as a because it was a research project back then. Like that's actually a big reason.
Um And it's more modular.
Like they designed it to be more modular. It's easier to add extension.
Actually, you know what? Sirius is a DuckDB extension.
Mhm. Like that's how it works. It's just just pure extension.
Um it was harder for to for Postgres to get started. For example, it's row store and it's more complex code base. It's harder to uh to make it work. Maybe it could be it could work. We we didn't we didn't try.
I mean, there's PG [snorts] Strom out of Japan, but I I've never met anybody using that. So, that makes sense.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











