Polars Cloud elegantly bridges the gap between local prototyping and massive scale by maintaining a unified API across distributed environments. Its cache-aware streaming engine marks a significant shift toward more hardware-efficient data processing that could finally challenge the dominance of legacy distributed frameworks.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
A First Technical Preview of Polars Cloud - with Thijs NieuwdorpAdded:
for Okay.
Yeah. Nice.
>> Okay.
live. Okay.
M. Okay. But fine. People are probably speaking English on the other side of the fence here. So, we're just going to speak English as well. But, um, hello everyone. Uh, today with me here is Ty.
He's a developer relations engineer over at the Polers company. And there's one little thing I wanted to say just as an introduction. And I'm partially saying this because I know Richie a little bit and I'm also partially saying this because I'm a fan of Polers. Um, but uh about six years ago, I'm going to show you my big claim to fame. I made this one PR after doing a very popular talk.
I I have this one PR uh that's going to add pipe to the Polars API. I asked Richie if if there was a unit test needed. Uh but given that the implementation was just a two- liner, uh this this causes the pipe function to work all across Polars. Um this was the one thing that was missing from the Polar API back then. Uh, but I was super excited. Like, PR number 82, like I was one of the earlier commits on that one.
Um, and ever since then, like Richie has been giving me some of the best dev swag ever. So, you can actually see, let me actually stop sharing because then you can actually see this kind of in a big um screen situation, but I don't know if you still do this, but Polar has some of the best dev swag ever because you have custommade key caps, which is totally something that we've copied ever since I've seen this. MIMO also has custom key caps. come to our conferences, etc. >> I just wanted to start with that like I' I've I've been using Polar for years now and it's really cool to see that we're very close now to like seeing the first Polar product like that's been ramping up now is my impression. So, we thought it might be fun to just do a quick demo and overview and you're also doing some things with MIMO. So, all that stuff together basically made us think like, yeah, let's do a a fun little live stream.
>> Yeah, exactly. Yeah. So the I still can't wait for the the first keyboards that are just entirely made up of logos of different companies and then just like the the first letter of the company maps to the key. So the the P is polar the M's MIMO. Curious if we can get that off the ground. Oh. Oh well, we were speaking about hash collisions before. I think you're going to collisions as well. Um what would totally be cool though, but um I think at this point in time you're going to share the screen going forward, right? Um a few people >> that are tuning in live as well. Feel free to ask questions if questions do pop up. I will interrupt Ty to make sure that those things are answered live as well. Um, but Ty, take it away.
>> Okie do. So, Polars, I think it's the first time that we're on the the Marimo stream with this. Uh, but I also presume that at this point a lot of the people that would watch are familiar with Polars. Uh we've been on the scene for I think I think we we had our first launch like six years back y >> in 2020 but then it was a small research project by Richie himself. Uh and didn't look at all like what it looks like today. Over the last couple years it's been uh gaining a lot of momentum. Uh >> more downloads than Spark now, right?
Like that's wasn't it like last week or something was the milestone?
>> Yeah. So, I think we we're over we we hit the the two million downloads per day on Pipi, which is like a questionable metric in terms of how many downloads there actually are because a lot of it's automated. There's other mirrors that pull from Pippi, but still it gives us a bit of a proxy.
>> Yeah, it's at least a metric that in in combination with GitHub stars, we can get like a little bit of a vibe of if people can find us and uh clearly they clearly they have. Uh so, I think we passed like the the hundreds of millions of downloads already. Um, with the two plus the two million plus downloads per day, it's not too bad.
>> I remember six years ago I was convincing people like, "Hey, you should give this Polish thing a spin because I think it's really neat." Like the API wasn't just the speed. I remember just the API was just so nice compared to like pen. It's like much more consistent. And then fast forward six years and like I don't have to convince anyone at this point.
>> I think people know by now.
>> Yeah. People people Yeah. So it might be that there's like some tricks that people don't know about like the water cooler moment something like I learned a trick >> but at this point polar is pretty ubiquitous in the ecosystem like that.
>> Yeah and it's also something that for example when you browse across social media uh you often times have like I think it was like the the pundas 3.0 0 release. There's a lot of people uh commenting about Polars uh underneath which is at the it's kind of nice for us because we know that people like it and recommend it, but at the same time it's also a bit harsh towards the panos devs that have worked very hard on the release and then a lot of the comments are about a different library. But >> yeah, I mean, so um the way I kind of always try to be polite about these things is also I mean I think as a as a Polar project, you're probably well aware that like pandas happened before you and therefore if pandas wouldn't have happened then probably polars would also not be where it's at right now.
It's same thing with like >> uh um we like from my perspective I love what MIMO does for the ecosystem but I also love what Jupiter did for the ecosystem because Marimo builds on top.
>> Yeah. Right. So like the standing on the shoulders of giants as Google Scholar once had as their motto.
>> Yeah. Well, and also like the there can be moments when let's say the whole Rust thing is a reason why people might not be able to use it. So there's still some pandas users out there and if people like pandas I don't want to be the guy that takes it away. And same thing with Jupyter like Jupyter doesn't just do Python, it does R. There's like a bunch of other languages as well, right? Um but yeah it is interesting to see that um it does feel like we're in this new age where you know before it was like Jupiter pandas and mapod lib and now it's a little bit more mimo polers and like alter and like you do notice that there's a transition in the in the common data stack I guess.
>> Yeah exactly. Yeah. So that that's uh uh that's something that uh we're obviously still riding that wave of getting more gain getting more and more ground especially the the single node side is by now very popular. Uh, one of the things that we'll talk about today as well is the the Polar cloud side of things. So, we've been working hard on a distributed engine and uh, uh, pretty soon we'll do a release where there's also like a self-s serve option so you can start playing with that by yourself.
Uh, and there's we're aiming for creating like a generous free tier so people can start tinkering with this by themselves and play around with that.
And also when you say this and when you say like distributed you mean across multiple machines because you always did like multiple cores I think right but this is the multi-achine uh situation.
>> Exactly. Yeah. So the I think one of the the the things that was problematic for Python was that you had the the gill the global interpreter lock where you're just effectively locked to one core unless you start getting creative with that. uh and uh after a lot of engineering and using rust for uh doing more multi-threaded multi-process all kinds of fancy techniques to to get make a use of modern hardware which often times has way more cores I think like there was a colleague who had 24 cores in his laptop and then imagine using one which that's just like 4% the power it's it's kind of sad so that that's where we started with a single node is to use modern hardware uh and now we're scaling didn't get up to multiple machines because I think like uh you can come still you can get very very far with cloud machines where uh you can I saw on AWS you have like machines that scale up to one terabytes of RAM and I've seen people use 1.5 terabytes of RAM like the amount of data you can can churn through is insane on machines like that >> as we say in Dutch ramarin >> yeah joke >> no pun intended >> uh I guess that one actually translates to English quite well too. Like just ram it in. Anyway, tangent.
>> We're going the wrong way.
>> Yeah. Yeah. Sorry. Um, no, but like but you've got these insanely beefy machines and also like what what's the number of CPU cores you can get on like a one TB machine? Is it like 128 or something?
>> I'm curious. Let's see the EC2 instance pricing. Let's go >> because you should just sort it by 896.
Then you have this is some funky instance name U7in 24TB 224x large costs $270 per hour on demand.
Yeah. And it has uh 24,000 gigabytes.
Beefy.
>> I have no clue what kind of machine that is, but >> but it's a big one. But I mean but even if you like with Polar, one thing I also appreciate is you can still do the spill to disk thing if need be. So like even if it's like out of core for that one machine, there's still tricks you can do.
>> Yeah. So the we've got the the streaming engine and that's that's something that uh went like we put a lot of development capacity into that over the last year.
Uh and uh it's a different execution model. So uh where it used to be in memory and wait I actually have some slides.
>> Can can you share your screen then?
>> Yeah, I will. Uh so I was I met you last Wednesday at the Budapest data and AI conference >> days ago. Yes. Briefly. Yes.
>> Exactly.
>> Let's see. Uh >> there should be a share button like uh just still looking at finding the right slides.
>> Okay.
Share. And now I hope that I'm allowed to >> screen or presentation.
I would go for screen. That's usually the easiest. But you can also pick an app. I think it's the Chrome. selection thingy. That's it, right?
>> Uh looks like it. Yep.
>> Okie do.
Yeah. So, uh if you've got a very basic basic uh query right here, let's put a full screen. Oh, wait. That might be a bit too full screen on my uh >> you can also zoom in manually, but I I can see the DAG. So, you've got this.
You've got like a an two inputs from paret files. You've got a join that slaps them together and then some aggregations that you can do on there and then the output. But that's not super relevant. The idea is that like the the currently the default engine that we still have running is uh is using an in-memory execution model. And the idea behind that is that like it it walks through every every step uh one by one. Uh it's tree walking is what it's called. Uh and uh the so just this this step is performed everything's loaded into RAM and you go to the next step. In this case, more is added to this. Uh uh wait, I'm not sure if you can hear my dishwasher in the background, but >> Oh, okay. I thought someone was making coffee, but that's okay.
>> Uh so this get all gets loaded into RAM and then you get the join operation which can read from RAM and process everything and uh cool part of that is you have all data available, right? So if you wanted to start doing some operations, you have that data available in RAM. You can scan through it and do all kinds of fancy stuff with it.
>> It's cached if it were. But but this uh this does no predicate push down whatsoever.
>> Uh this is including the predicate push down. So this is already with the lazy API. You can see for example you have the the filter is pushed into this this scan node here.
>> Right. Exactly. So you still you still do some clever things in the sense of like okay when we we try to read as little as possible but given that we have to read something do a little bit but it's not streaming quite yet because you do cache it in memory.
>> Yeah. So it's in that sense we have like the the eager API which uh if you go to it's like the the first level that you go on it's like if you have your query then you get it does step by step in terms of code. So you start with two files then you join them. It instantly joins in memory everything and after that you start filtering on the color green for fruits for example only then a part of your data set will be removed and if you use the lazy API you start uh you create >> filter up more.
>> Yeah. So you create your query in an entire graph first and uh you can apply optimization business rules to that. So you can see for example, hey this filter, if we're removing all the green fruits, we also don't have to join those together in the first place, right? So you push this filter down all the way to effectively where you start scanning a file from disk. I don't even want to have the green fruits in my RAM in the first place. So that's like the first optimization step we have.
>> You only want the green fruits in your RAM.
>> Uh oh. Yes, sorry. Yeah, exactly. So you only want to want to keep the green fruits. You want that's the one that you want to load into RAM and that's what you will join. But the rest of it you can already discard while you're reading through the data. For example, >> and like one thing I just want to quickly mention because I'm really old school with my pandas polar API. I remember you would do PL column and then you would do function and you would sort of put a string of the column name in there. What I see you do over here is you can do dot as well. So it's like a sort of a recent trick that got added.
>> This is already something that's been in there for quite some time.
>> Okay.
>> But this is uh these slides were made by Osul and this is his preferred method of column selection. Right. Okay.
>> I think for me the downside that you have is uh if you have a column that's named >> two color or like >> fruit color then you can't do fruit color because this Python doesn't allow for this and then you have to fall back to doing it like >> like so.
>> Yeah. And I think >> and you can just slap a string in there.
>> Yeah. Exactly. There's another I think you can't do numbers as well. Like a few of those caveats. So I think >> Exactly. Oh, you can't start with a number I think.
>> Yeah. So my my rusty old habit is fine is what I'm hearing. So that's good.
>> Yeah. effectively. Yes. But this this is is one of the options.
>> Sure. Okay. Just wanted to double check.
That makes sense.
>> And uh >> and a sale percentage at the bottom.
Yeah. Yeah.
>> Yeah. Exactly. So this is this also there's like a a notation for for could be the same as if you do >> alias. Yeah. Yeah. Yeah. Sometimes sometimes I do like to use alias actually. Depends a bit.
>> Something like this. Yeah.
>> It's the same it's the same thing. It's just uh the down there's the the limiting factor of this is that after this you can't do uh bical blah alias anymore because you did a keyword argument first and you have to keep putting the keyword arguments at the end and then you can't throw loose >> things in anymore. Yeah. So that's a Python thing. So the the mostly the API things that we run up against uh that that don't like the most restricting things is just because Python doesn't allow us to overload >> the keywords keyword arguments normal arguments phenomenon. So that's exactly can't blame you for that makes makes sense.
>> But this is still like the normal eager API.
>> Yeah. So this is the the step-by-step thing and uh what's different about the the lazy API is then you still get the same thing. So uh if you are working with data frames that are in memory, dataf frame is kind of the representation for the eager API if you want to switch to this lazy API where you first build the whole query as a graph and then apply optimization to it which is generally uh it can be like 10 times as fast compared to eager you can swap between the two by doing either lazy on a data frame. Then from there it's it becomes kind of like uh you create a query graph and as an input node it points to the memory and say like hey this is where the the it start it's uh where the the data starts where it's stored and after that we're going to do all these operations and then with a collect this this is like from the the lazy it's you're you're working with a lazy frame which is the the lazy API that representation that can be optimized and with a collect you start to say okay now I want to run everything and optimize for it and with like the full information of what the quer is going to look like. Uh you can generally often run it in a more advanced or speedy way.
>> Yeah. Except in this particular case, you're assuming the data frame is already in memory. The polar scan instead of read will be the way to just read it in I suppose.
>> Yeah. So that's that's like if you want to go full fully lazy fully make the full use of the lazy API you already scan it from uh the location where you want to load the data from and then you can push like these filters into this this scanning what we talked about first because sales is now already an object in memory. So we can't we can't filter it from disk for example that's something that comes comes after.
>> So this this then collects it back into memory much like pispark also has a collect uh there's like this we run it now and all the results are then put into memory.
>> Uh so that's like the difference between eager and lazy and uh >> and it feels we're going to talk about the difference between this and distributed at some point.
>> We will feels like we're walking towards something here.
>> Definitely.
So this is what we talked about with the the scan perk like these filters in the when you're using the lazy API are being pushed down all the way to uh into these these scanning scanning files >> and this way like already the data that has to be joined and then we run the select on that's already smaller. So that's that's uh we prevent IO we prevent computation that's unnecessary.
So there's all these different optimizations that we can do this way.
There's also if we see uh in the query graph that parts of it are duplicate so we reuse one uh data frame in two parts of the query but the self join or something then we just cache it and then both these points that are consuming it uh are using the same computation so we don't do the work twice all kinds of stuff like that even with expressions that you reuse are also cached.
So there's a lot of smart stuff going on and uh going back to the to the engine that executes it because so far we've been talking about the API which is more uh the these are like the queries that you type and uh you tell polars this is how I want them executed so either eager or lazy uh then after that the engine starts doing the thing right and for that currently >> machine goes brr >> yeah exactly and then data rolls out but how that happens on the machine uh is determined by the engine that compiles this graph or how we internally call this is like the intermediate representation which is between code and what the engine does. Uh this gets optimized then gets handed to the engine that makes a physical plan and starts executing it >> and this execution model is more under the hood. So currently we use the im inmemory execution and uh so this executes it step by step and stores stuff in memory but uh the downside of this is that it can be very memory intensive. So you can imagine >> especially when there's a join in there like that becomes the hot bit.
>> Exactly. So uh if I have like this already requires that the input data has to fit in memory right because if I don't have these two I'm going to read this then I'm going to read this into RAM. If it doesn't fit, the query crashes and you're out of memory and that's it. You just have to either get a bigger machine. Uh like the nice part is that because you have everything in memory, you can do all kinds of aggregations because you have all the data available and you can you can go crazy with what you want to apply on that data because you just have it available. But the downside is you need to have it available. So you need enough RAM and especially with AI eating all of our RAM these days, that's not something that we want. So uh over the last year we've been building a different execution model which is called streaming and uh the idea of the the streaming engine is that uh it still uses this lazy interface. So in terms of your code nothing changes. So the for now uh because we haven't uh set up the default yet although in the upcoming release we will stabilize it which we could have done a long while ago. We just found out it was still unstable. So I was like okay so we have to stabilize it first. And you can already start using it by uh setting it to this, which I recommend because it can speed up queries by three to seven times so far we've se we've seen so far.
>> And if I remember correctly, I think uh so if you had a GPU, there would also be an engine you could specify there, right? There's multiple settings there, but like the streaming one that's going to be default soon, but you could also apply GPU or like other things in there.
>> Yeah, exactly. So we have the in-memory, streaming, and GPU, I think, are currently the configs that we allow. And with GPU engine, you can set all kinds of specific configs that are also related to how Nvidia will will process it with using CUDA. Like you can set all different kinds of memory pools and you can uh have a lot of freedom in the the config there >> and go to town.
>> Yeah, you can go crazy. I I uh dove into this for a blog post a while back and there's a lot of options. Comparing all these different models to each other is quite quite a thing. uh but for now the this is like the the way we recommend doing it and there's additionally you can also at the top and you can do something like plconfig and I think it was set engine affinity and you could just set it streaming and then any collect call >> you can just leave like this >> and then it will set try to all these collect calls try to run that on streaming engine >> um and that runs it in a different way because uh the nice thing about the streaming engine is that It kind of chops all the input data up into essentially like micro batches. They're called morsels. Uh and a morsel is just like a a slice of the data frame that is taken as input. And we already try to process that as much as we can by and the nice thing is if we because we just have a small slice, we can fit that in the CPU cache and you don't have to swap it out to RAM and back which is is fast but it's faster to keep it in CPU cache.
So we see we try to apply all these operations as far as we can on this slice while we have it in CPU cache which uh gives a lot of cache effect uh and >> I would imagine you still have the predicate push down as in if you can filter earlier you're still going to try to do that in your execution right >> that's that stays the same but it's more that once once it's time to actually start rolling with it then you're going to use the CPU cache as much as possible going forward >> yeah yeah so so this uh these optimizations are performed on this intermediate representation >> uh so that's that's something we saw before like we still have this this is already the optimized version because we can see like hey we have a filter in these these uh the uh scan >> so there's like some sort of a plan that we serialize before we give it to rust so >> and that is still tells the engine like what are we going to do and then the engine turns that into a physical plan which is how are we going to do this and the how is different for the in-memory engine compared to the streaming engine >> makes sense >> and uh so so roughly chopping it up into small bites and processing as much as you can while you have it in cache available and then only then uh if you have to swap it out you can put it in RAM but it's like usually there's some blocking operations you can imagine for example if you have uh a sort operation then all the data needs to go through that operation because you can't sort things you haven't seen right you don't know where it fits you need to have gone through all the data so there's like more blocking nodes >> and up until that point everything is like for example if you do selections you just take a couple of columns and drop the rest or you like the sorting unfortunately is not uh like embarrassingly parallel fact of life. Uh unfortunately >> or we had it here as well like for example the head operation like if you apply selections and filters before you don't know what the top 10 of your data frame what it's going to look like because maybe the top 10's filtered out so maybe it's rows rows beneath that. So you have to start running uh everything here you can do in parallel uh up until a point where that's not possible. So we try to do as much as we can in parallel.
And the nice part here is you can just thread it over course uh and make make maximum use of a multi-core system and make maximum use of this cache effect and uh in the end also make use of multi-threaded writes and writing it to the cloud or you can uh have a lot of uh advantage from that that streaming model and this is also something that uh in the upcoming months we want to make the default engine. Uh so so far uh we haven't or there might be some edge cases out there uh where the streaming engine doesn't quite do what we want or uh might might give some some errors. If it throws an error, it's always a bug.
So please let us know please and uh otherwise uh there are some nodes some operations that we don't support yet on the streaming engine. Uh and then just for that specific node we fall back to the inmemory execution and afterwards we start streaming again. So we tried to still use that optimally.
Uh and we're effectively waiting until we have a bit more developer capacity to catch some of the bugs before we before we set it to default. But we've already stabilized it and recommend everyone to start using it. I think I still have a fancy fancy plot to show the difference between the in terms of performance. So this is uh one of the benchmarks we might al also use today. So it's a PDSH which is derived from the TPCH but because of licensing we can't call it the same thing. And uh we compared the inmemory with the streaming one and you can see that generally it's it's a lot faster especially there's some queries where you can see up to like seven times performance improvement.
>> Uh and of course you can imagine like very very small differences. Yeah >> actually one time one time 22 22 it's >> one this is this is a presentation from a while ago so I hope we fix this by now >> but still but still orange line smaller.
So >> that's generally so uh that's also why we want to use this as the the go-to engine because it's just a faster model.
>> Makes sense.
>> But this is still all happening on one machine. Correct.
>> Yeah.
>> Exactly. So this is all this is single node. Uh and uh so uh on top of just it runs faster. We also see that uh depending on your query and data set because it's all you can make so many wild different cases. It's always very hard to throw out blanket statements as polars. But uh generally uh the it'll it can use less RAM as well than so for example if I have a 50 GB data set then the peak RAM usage on some queries might be 10 to 20 GB because it can process it all and just dump dump stuff to disk already uh while it's still uh reading data from the file for example because the whole query is being executed while you're reading as well. Uh so that's that's uh that already brings us a little bit into the the the larger than uh data sets larger than RAM territory.
And something that we're uh currently working on is also out of core or spill to disk whatever your favorite name for it is uh where the idea is that if uh you uh even in this this query you uh start to run out of RAM to store stuff into where you swap to disk as well. Uh so uh there might be some some points where while we're running a certain operation uh we are we need to pull a data set into RAM. Right now we would crash at some point if it runs out of RAM. Uh and we're working on making sure that we can also process queries there like add a performance penalty because swapping stuff to a hard drive and to RAM uh also costs performance but at least you'll finish your query which is nice. M >> uh so yeah so that's that's the the the single node side of things. So we're working very hard on both making uh like the combination of vertical scaling and out of core getting you very far in terms of data set size. So we're aiming at already getting you to terabyte size data sets on just a single node which is not something that used to be possible.
Uh and on the other side of things, uh we're working on the distributed engine.
So this is just uh the the what what we pride ourselves in. Oh, I think uh Robin left a note up up on the slide there.
>> That's what you get when you actually see it edit menu instead of presentation.
>> Yeah. So the the what we used to have is like this gap where pandas on a single machine kind of works up until depending on your machine like 10 to 50 gigabytes and then you needed to swap to something different. You maybe dusk to kind of cover this gap or you move to spark and kind of have this overision overprovisioned cluster. Uh so spark spark is is lovely for distributed large scale queries but uh the use case that I saw uh get covered most of the time is you kind of create a slice of data so that you can work on it locally with pandas and do your thing. Uh and that is is has always been the use case I've seen the most for spark the the spark sampling engine where you sample so it's small enough so you can do it on your laptop again.
>> Exactly. Yeah, >> and that's that's a bummer because you have the cluster maintenance, you have a lot of machines, you or you just have spin up times uh of of a couple minutes before you can actually get everything running. And uh that's annoying. That's a kind of a bummer. And on top of that, if you want to make one if your data grows for a query, so you work on the sample in your machine, you do your analysis with pandas and you have to rewrite your pandas back to Spark. We don't like that. And that's why uh what we got is we've got the so this query is like the much like the code you saw before often times a lot more complex than just a join and a filter but uh all this code can work in the in the boilers API and then at the end you just tag how you want to do it. So for the the single node you can do collect streaming collect the memory collect GPU uh and uh what you can do for the cloud is you add remote with the context and an execute.
So this is all this is uh this is the the polar cloud API that we're working on. It's also that's not stable yet.
Something we're looking to see if we can integrate that with the polar API or what what's the most ergonomic but something we're experimenting with. But currently it looks like this. And in it you can define the the context that your query will be running in. So a compute context. So for example uh you link it to the workspace of your pods cloud account. can create a couple workspaces where you have like all the the queries that you run the compute uh that you uh like the links to the compute that is running or is already offline. Uh you can kind of create these templates of what kind of compute you want. So here I define uh a certain instance type and a certain cluster size and then it will provision 10 of these machines and start running polars on it. But I can also uh up front register uh a certain cluster as like hey I want this cluster I name the cluster Harry or whatever and then give it certain specs and then just refer to Harry in this compute context and that cluster will spin up and uh will process the queries and then you execute it and the idea here is that it uh this is more aimed at like the the enterprise level of data like if you're working with many many terabytes or you hit the pabyte scale then even with the a chunky single machine without of core it's just going to be slow and you need more. Uh so that's what that's what Polar's cloud aims at that you can still grow to this distributed scale without having to change API and uh simply changing the the engine that runs underneath.
>> There's um there's one question that came out of the audience while this was going. So um did Polers have to come up with its own query tuning optimization algorithm for lazy evaluations or were they able to borrow some from the SQL database uh world? Could see how the language uh can get in the way. And I think this question was asked when we were talking about like the keyword arguments and arguments.
>> Um I I I would imagine me knowing Richie and I think you can uh agree he's really into database stuff.
>> Yes, absolutely.
>> Like I like he's on top of papers and all of that this domain. So I would be highly surprised.
>> Yeah. So I I doubt that there's uh like obviously we have to to figure out how to put this into the the Polar engine and the the the Rust system and the all the the relevancy there. But a lot of this is just database research that has been done already and just wasn't applied. And then we uh we just take that and build it and tackle the the challenges we run into along the way.
>> And that gets you very far. turns out.
So there's definitely strong links with the academic side of things. I think also like some people that have done PhDs are now working for us. Some people that worked for us are now doing PhDs.
So again, >> this academic connection is strong.
>> Yeah. And it's also one of these fields where um you know you have to really drill down to sort of make a dent in in human knowledge. But if you do make a dent, then suddenly that can mean that the query does suddenly run. So it's it's definitely like a niche but a niche where you could make a big impact if you just happen to have that specific query like problem.
>> Yeah.
>> Cool.
>> So yeah and I think during the talk I was also talking a little bit about the stage graphs and fancy stuff like that but that's something that I can I think is more show incloud.
>> So this is all like really good background but like I think in the title we we sort of imply that we're going to do like a live demo of the thing as well. Like I can imagine there's like 20 people watching live. We should give them a life.
>> I did a ton for a slide deck.
>> Yeah, exact notebook. Nice. Nice.
>> Yes. So, the I think the main way where I uh use polars cloud is I type in UV run- isolated dash with polarso polar cloud and then you do maro edit. you get this interface and then you have effectively what you need to run polars without having to create folders and create all kinds of stuff that you want to put in here.
>> You don't need a virtual environment or requirements.txt file or any of that.
>> I just want to do an analysis throw a query towards polars and that's that's that's it effectively.
Uh let's see which is a fun one to take.
I think the ADC one.
>> Okay, we're let's go through Oh, this is okay. Dutch references as well.
Excellent.
>> Yes.
>> Yes. Get to be Dutch. Um, okay. But but so you set the affinity engine on top.
So first cell, right? You're basically saying, okay, uh, the whole thing streaming. So you're going to see collect happen in a bunch of places, but you're not going to see uh collect engine streaming because that's the global we set.
>> So I think here I wanted to do some comparisons between pandas and polars.
And uh, oh yeah, I don't didn't install this, but >> that's what most form. Maybe a text is missing and it will take care of it.
Yeah, >> it's nice when it just advertises itself like this, right?
>> Yeah.
>> Let's see.
Then everything imports magically. So, I think the only thing I ran into is that if the first time if you run Polars, it imports uh 32bit uh index space. But if you then start working on Polar Cloud, it requires a 64-bit and it's already imported. It's the only thing.
>> Oh, so the order of import matters.
>> Yeah, exactly. And then you just restart the kernel for the first time ever in Marimo and then everything works quite nicely.
>> Uh let's see. I just I think with the numpy I only used to create like this super real data set that I write into orders.
>> Okay, fair enough. Yeah. Yeah.
>> And I had to make it beefy enough to notice the difference because >> I would have also accepted super real orders.paret as a file name. But uh let's see >> in the array.
>> Did something break?
Of course it broke.
series constructor unsupported in the array for values. I thought you accepted numpy, right?
>> I thought I did. So, just do the thing.
>> Yeah. Okay. Well, >> see what happens if everything magically works. Now, >> the shortcut, by the way, uh I always do command K. That brings up the command pallet. Then I just type restart and then I hit >> nice fancy thing.
>> And otherwise, I just need to start a marrimo pair so I can keep talking with you while the AI fixes everything. Ah, there it goes.
curious though. Uh it might have been because of the installation thing.
Anyway, we'll go on.
>> But you're going to use the you need to install Pyro. Yeah, that's that's we Yeah, that's an annoying when you sometimes >> So, this is just the the pend was it command K and then you can >> Yeah, restart. So, okay. So, pi arrow is really annoying because the moment you import it and it's not there. It's sort of like there's a flag set.
>> Yeah. Um I ran into that too, but >> it's quickly fixed.
still generating the data and writing it to a I brought it to a file, so it wasn't even necessary, but still.
>> Yep. Okay. But now we read it using pandas.
>> Oh, a pipe.
>> Yeah. Yeah. Well, I didn't implement the pandas one.
>> Oh, fair enough.
>> Can't take credit for that.
>> Cop copy the homework to to polar.
>> Yeah.
>> Change it around a little bit so so don't they don't see it.
>> Um, >> okay. So, three 3.8 8 seconds is what I sort of see there.
>> Yeah, exactly. And uh just some small aggregations, some uh filtering. Uh >> yeah, >> and sort at the end.
>> Yeah.
>> And so the the first thing we start with is eager. So here the the difference you can see between eager and lazy is that I start with a read pokeet. So read just gets everything in the memory and starts working in eager mode from there.
>> And if you want to go lazy, use the scan pokeet. Uh so that's like the the the only difference uh in terms of code between these APIs and of course at the end you have to do the collect uh because otherwise I think we've seen some benchmarks where they uh they compare the read speeds of polars with pandas and they do like a read rep read in pondas and a scan in polars and polars is like an instant read like >> there was a there's a there's a slight controversy with a education provider that shall not be named but uh but I remember the incident. Yes. Yeah. Um I had which I had nothing to do with >> not at all.
>> No there you go. Like eager is already uh pretty darn quick.
>> Yeah.
>> Luckily it would have been very very rough is like 50 seconds now for some reason but no this is also and this is mostly because uh we do stuff in rust like the whole critical path is uh written by us in rust. Uh so the moment it starts touching disk uh to the moment it prints stuff out that's all in the rust engine. So uh the Python Python shell is very thin. Uh this just gets translated very quickly to expression objects which is relatively instantaneous like always of course.
>> I mean I remember the first time that I ran polers I I discovered that doing the whole analysis was faster in polars than even reading it in in pandas.
>> Yeah. Um >> let's see because for example this is just creates an expression object. So >> doesn't run anything >> it's not it's not a thing this is just like a uh effectively you create like this tiny tree of operations. I think you with meta you can even show a graph of this and then you get like >> the representation of the expression which is just like well we take the column we sum it there you go bonapet >> and so this is all very this is all quite quick. throw this into the rust engine and rust starts crunching numbers.
But we can do better. So for example, we had this filter operation uh and this already uh kicks out a lot of the data of this parade that we don't have to read. We only pick a couple columns and uh we can already apply this filter while we're reading information. So we don't have to put it into RAM and fetch it later.
>> Let's see. Because you can see uh >> more elaborate graph, but there you go.
>> Yeah. Yeah. Exactly. So this also has a bit more of the technical information.
So uh one of the things you can see here you have like the the pi and the sigma.
So pi refers to the columns. We only pick five out of the eight available columns in orders.park.
Uh so this is like the columns that we're working with here. I think we have the the revenue, region, category, date and status. That's the only ones we touch. We don't touch others. So why read them? And the sigma is the filters that we apply. So we can see this date has to be smaller or larger or smaller than this. Uh small >> it happens all the way at the start.
Yeah.
>> Everything after is basically smaller.
>> Exactly. Status has to be completed.
>> Uh so this is already while we're scanning it into uh to to get us into the streaming pipeline. That's kind of like the gatekeeper that the bouncer that checks the IDs. It's like no, you you're not coming in.
>> One one quick. So what type is that object that comes out?
>> Uh do this.
>> Yeah. So if you just do type around the final line there.
>> Let's see.
>> Okay.
>> HTML.
>> Okay. It's a marrimo HTML object. Okay.
Now, so the thing I was kind of I think you can there's an opportunity to make it like slightly more interactive and fancy, but you're already emitting it as an HTML. Um, okay. No, makes >> I think because this might actually be with the in-memory engine what we're looking at and if you want to go to the streaming engine then oh it's like yeah >> it's like a line changes and uh so what we're looking at here is this intermediate representation with optimizations on you can I think you can turn the optimizations off and then we can see uh oh wait what am I doing you mean boom wait what optimize This is false.
>> Yeah, it's the Yeah, >> that's it.
>> So here we can see like we're not applying filters in this parket. Then afterwards you get these filters. So this is effectively like the the idea of eager, right? That you you first scan all of the columns uh all of the eight columns. We don't apply a filter here.
Then we start filtering filtering filtering and then we run everything. Uh but if you optimize it then uh it looks different. And >> before you uh could you move the filter after the aggregation?
Do we still like I wonder is the filter still moved up before the group by?
>> So if we uh oh wait might not have the >> oh we might have a column missing >> because we do the group by >> oh with the aggregation.
>> Okay never mind.
>> But generally if you uh like if you push the the filter further up the tree with optimization it still gets gets pushed down all the way to the scan node.
>> Yeah exactly. Yeah.
>> Yeah. So I think here it's a bit hard because we immediately do the group by >> if you if we were to to filter on revenue for example I think that gets gets pushed through all the way there >> makes sense >> and so this by default you see the optimized version >> and I think uh was it plan stage so this is the that intermediate representation right it's still just it shows our query but with these optimization business rules applied uh if you want to go hard on the streaming engine you can see what the streaming engine will do if it gets this plan and starts executing it. So here you get a bit more of the like you can see a multiscan. This is like the >> oneo thing you can do is like there's the bar where the output and the input they exchange like right next to the red delete bucket. Yeah. So okay a little bit more to the right. Go down. Not that one. The one. Yeah. So that will expand the entire to scroll the entire time.
>> Nice.
>> And there's a legend which you can now see which is legendary.
>> Yeah. So I think in this case uh the colors kind of drop off but you can see like the yellow is potentially memory intensive node and in this case that's the group I because data has to be partitioned you need to have the entire group's data and to be sure that you have all of the groups data you need to have data available through this pipeline. So this is one of these these nodes that can be blocking operation.
Same for the sort and uh in rare cases you'll see a red one which is the the inmemory fallback and then just that node is performed on the inmemory uh code path >> and when that that happens if there's like something not implemented I suppose or like when would that happen?
>> Yeah. Uh so I've had I I had to ask some of the devs like hey do you have a note like this for me in examples because I could have never found one myself >> because it's it's more the edge cases. I think it's like when you start doing uh some uh some more aggregating expressions with like an an over that you're performing it only on a window of the data uh and start flipping between different columns or >> sorted partition. So if you do sorting within a partition but the partition itself is huge.
>> Yeah. Some something like that like if you start uh start doing multiple things in an expression at the same time involving multiple columns or groups then it might uh not be implemented yet because it's just more of an edge case.
Uh but generally with the the more standard queries that you do, you won't see it anymore.
>> Uh and often times if you find queries like this, please let us know then we can see if we can uh also lower it to the stream is what we call it.
>> The UI will also update with an email field at the moment.
>> Yeah, exactly. Just like click the button auto opens an issue.
>> Yeah, exactly.
So uh one of the things we can see here is like uh the streaming engine often has these paths where it first creates a temporary column which it then can use here.
>> Uh so for example here uh we do a group by on some temporary columns and then here we >> kind of do a select. Yeah.
>> Yeah. So we have like the freedom to overwrite this and then we do select on this frame in between and then uh we kind of can can discard information from here before here. So use only what we need.
So this gives you a bit more information of what's what happens under the hood.
Um >> I guess we now got to run it.
>> Yeah. Good time. So what we're doing here is the the result lacy. We can do it with a normal collect because up before in the code we set it to the streaming engine and with the the perf counter we get the fancy performance counter which we can benchmark.
Benchmark is a very very heavy word for what we're doing here.
So we just time one function call which is not hardly benchmarking but still we get a better idea of how fast it runs.
>> And here I'm able to cut it down to I think it's it's like half. Yeah.
>> Yeah. Yeah.
>> Even less cut it down to 07 seconds.
>> And uh generally what I find myself doing is like uh when I'm when I'm working in notebooks like these uh often times you're here to explore data or to create visualizations in the meanwhile.
That's when I tend to stick to eager if if your data set allows like if you have uh super large data sets then uh you can also kind of sample a bit and start working with that. Uh but generally for for explor exploration the the the code path for the the eager API is very practical because you can just kind of uh like one of the things I tend to do.
>> Oh you can do caching right like that's the thing you can kind of say like oh I've got this intermediate result I just have a couple of quick charts for it.
>> So you can see like hey so I filtered out what am I working with what am I going to because next step is going to be a group by but what do I have right now? That's that's for example that's something that's just easier to do >> then you can just comment comment code in and out you can it's a bit easier to work with >> a trick that I sometimes use so you know that pipe function that we talked about earlier >> uh and this is also something you can only do in eager mode but what you could do is you could say oh um I have this function called echo and it's just going to echo five lines randomly and then I pipe that somewhere in the in in a giant pipeline that helps me debug if there's like I expect it to look like this I expect it to look like that and then yeah put it in two steps as a trick I sometimes also like to use and also that doesn't really lend itself nicely for non-eager workloads.
>> Yeah. Nice. Yeah. So that that's that's just kind of piping a print effectively, right?
>> Yeah. Exactly. But then I can have prints in multiple places and I sort of understand where in the pipeline something unexpected might be happening.
>> Yeah. Exactly. So that's that's generally like if I'm constructing a pipeline or if I'm doing visualizations or EDA, that's where the eager API is very practical. And then usually if you want to run a query like every every Monday at 4 in the morning then you just do it in a lazy way because you just want to get it done as fast as you can.
>> And I think also some of the utilities right like uh you've got the plotting utility and there's like a few of those things. I think they all demand eager if I'm not mistaken. So you can always call collect but I think >> I think like more and more we're seeing that uh there's also uh you can just hand a lazy frame to some APIs as well.
I think especially because we have narwhals is kind of this this uh >> yeah but does your B oh we can actually check so if you do lazy Q and then plot does that actually have it might be the API update I remember that I remember vaguely that wasn't the thing that had that worked >> oh this uh yeah so for your internal one right >> yeah yeah uh I think for this you do need to collect it right >> yes exactly yeah yeah >> does it no >> and then dotplot then if we see it exists then we know enough.
>> Can we borrow it or is it just uh needs more?
>> Oh, you don't care. Okay. Okay. Point point made. Point made.
>> Yeah. So, for if you want to use this this name space, yeah, the the plotting name spaces is only available on the the eager API.
>> Yeah.
>> But there might be like some external u libraries that do accept a lazy frame as well. And then they will collect >> do the Yeah. Do the Yeah, exactly. Do the collection internally.
>> Um, okay. I I kind of want to move down because I the next header was something about cloud and that's what we're walking towards.
>> But something is telling me that this demo may not be quicker to what we just did.
>> Yeah, exactly. So, we got this super tiny file. This is just like a local park file, right? So, this is >> uh uh lazy is also nice here, but this mostly to show like the the performance the the jumps in performance you can make by going from eer to lazy.
>> Uh but of course, so here I've got slightly different examples. So, the one uh >> start the 12 node cluster. Yeah, you can go hard, baby. Uh in the T playground.
>> Yeah.
>> And uh so one of the the things we work a lot with is this this TPCH benchmark.
So we've got we've got like a uh in this case the SF refers to the the skill factor. So this this benchmark has a skill factor which kind of >> uh indicates what the size would be of the entire data set if it were in uncompressed CSV.
>> So in this case that would be uh a terabyte of data of just raw data. uh in practice if we store stuff in parquets because paret is nice and >> yeah you can do columner compression tricks and >> yeah so there's a lot of stuff that parkets do better than CSVs or because it's just a more advanced thing like CSV is just text right uh and you can compress that too but uh the nice thing about uh paret is that it has a lot of columner compression so on a data type you on the same data type you get compression uh it's got uh general schema information so in CSV we always have to scan the first couple uh I think by default 100 or a thousand rows we we scan uh and then we check what the data type can be >> assumptions >> yeah we're just like this kind of looks like a float because it has numbers in a dot and then if later on it turns out to be string then we just crash on it or you need to provide a schema or so that's with paret you don't get that paret we just see oh this column this data type >> everything's in there >> uh and on top of that uh if the data file is large enough you get row groups so you not only you have like columns but you also I have row groups and those also have statistics. So for example, if I were to sort on a datetime column and I only want to grab a week, then in the min and max of this column, you will have like the see that uh your daytime range is not in there.
>> Yeah, those are you can pre-calculate some of those things on partitions within the paret file, right?
>> You have to set a flag if I remember correctly if you want to store it correctly like details. But uh boy does it speed up some queries.
>> Yeah, absolutely. And then it's that's that's the one of the parts. So we don't read certain columns if we don't need them. But we also can skip row groups if we see in the the metadata in the statistics that there's not the range of data that we're looking for. Uh so that's that's uh and there there's plenty of tricks like this. I think there's there's even a range to put like effectively a dictionary of different flags. So you can even mark in the file if stuff is sorted or not.
>> I don't think there's a specification for that yet, but I'm not quite sure because P is also moving. Uh so >> yeah but having a pre-sorted flag that that guarantees it would be amazing >> in case you know like my data is sorted you can always say that sets sorted goal and then we will the engine will just assume it's sorted now it's kind of you you pass this information to the engine you can make optimizations on >> but if you set that thing and it's wrong then you're gonna >> yeah if you tell polar wrong information we're going to do wrong things >> yeah but there's no checking mechanism either right that's >> no so you can do it by doing sort But then in case it's already sorted, we're all it it's not necessarily a noop. If we don't see in the parket file that it's sorted, we will try and sort it even if it's already sorted.
>> Uh so that maybe suboptimal. So that's like some of the opt options you can do.
And um yeah, so uh this benchmark it's it's got sales data. It's all synthetically generated. Uh and here we've got like a query where we have some line items grouped by certain key. also not necessarily the most natural text, but >> sure. But but also but also just to really point out this is an actually pretty big file. Like we're not doing the the thing you just simulated. We're actually moving on like we're we're doing this on a big >> Yeah. So the the ones we have uh these are like the the data sets that are publicly available. Uh and uh in it uh we've got a couple p. So one of the the largest ones has got all the all the items. You've got orders, suppliers, nations. And the this query tries to see uh find the suppliers that kept orders waiting. So because you've got multiple suppliers per order and of course like the the one that keeps the order waiting the longest is something someone you want to find which you can imagine can be a bit more of a complex query uh there like I think we've got a self join on here. So we start with this we do a group by and then we join it on itself again. Uh we join suppliers, we join nations, join orders. So there's a it's quite joinheavy.
>> After that we apply filter. Yeah, which gets pushed down luckily. Uh, another group I and then the sword because we want the ones uh like the top hundred that are the >> the roughest here. And because this is on S3, I don't want to process uh a couple hundred gigabytes over my internet connection because on top of the fact that I'm streaming right now, my my internet connection at home is not that that fast. Uh so the we're gonna want to have the what wants to do this uh polish cloud to do this for us. Let's see. The first thing of course we authenticate.
>> Everyone look away.
>> Don't don't look at my password. No.
Yeah. Yeah. Exactly.
>> Yeah.
>> You can connect.
>> Uh do you always have to do this or can I'm assuming you also have environment variables at some point to >> for like batch jobs I stuff, right? I think right now we I'm not sure if we have keys for this. We do have service accounts that provide keys. Uh but those are generally for like the automated role. So if you do it yourself, I think generally we work with this PC authenticate a lot. But there are options for service accounts with uh environment >> so what we're demoing here is definitely the use case of oh person is in notebook situation. But if you then want to quote unquote put it on airflow or something like that then we do have a different system for that.
>> I think it might actually still be an airflow service account somewhere in here. Oh, not in this uh this this this environment. For an Airflow post a while back, I made a service account, but I think uh we we we cleaned the environment in the meanwhile. So, this is on my happy playground. Uh as you can see, there's not a lot of users other than me. Uh I've been playing around with some queries recently, but generally uh it's not the busiest environment. I've got a tab with all the the compute. So, we can have it we automatically stop it. You can set uh like the the limit when you want it stopped. Uh I think I set it to like 10 minutes, might bump it to an hour actually for uh this video so we don't get kicked out of the environment. So we keep the cluster running because the cluster has a lot of the the information of course on how the queries run and uh that's why we want to look around. So let's see let's go back to the MIMO.
Uh so we define a compute context. You can effectively define what you want. I just recently found that this is a nice combo to work with in uh in terms of having both enough machines to optimize network bandwidth because the smaller machines relatively get more bandwidth and they have some bursting a availability compared to the one chunky machine type.
>> Uh so >> yeah, you pay more for the CPU than for the bandwidth, but you do need a little bit of bandwidth because these machines have to talk to one another while you're doing >> Exactly. But you also don't want to use like uh 500 nano machines because then they're only talking to each other because nothing fits in memory. Yeah. So there's like a bit of an optimal balance in there and that's that's somewhere around here roughly.
>> Um >> as you say in Holland not a finger but but it works.
>> That's finger.
>> Yeah. Wet finger work.
>> Uh and so I've got the I defined this as a function. So we can start calling it.
Uh so this is all uh this returns a polar lazy frame. Uh so you can also do collect. It's all the same.
Let's see.
I think >> the main thing that we're doing now though is we define a function and we're making the context. We're not actually running the job in that cell, are we?
>> Nope. No, that's at the bottom. But I think the only thing it's trying to fetch in the background is that the uh I think Marimo tries to see if it can collect the schemas of these these S3 files. But locally, I'm not uh logged into AWS. I think it just throws some errors about narwhals not being able to fetch some things.
>> Yeah, that that's a known. We're aware of that one. Yeah.
>> Yeah. Yeah. And uh I think if I do were to do like an AWS SSO login, then it can fetch these files and you can you can get some metadata of what what you're working with.
>> Uh yeah. So this is I think purely an example. I don't think we're going to use the ho frame.
>> Uh and then we've got this file and we it requires like because it's it's a function. It's a Python function, right?
So this takes a couple of these lazy frames that we scanned here. We could throw that in there and then uh so because we're running this remotely, we just do remote with the context we want to run this on and then we execute it, >> right? But execute is not the same as collecting, is it?
>> Uh so execute is uh effectively like the the the uh remote collect, >> right? So it's collected but on the cluster, not on your machine just yet.
So >> and then the on the cluster it writes it to a temporary S3 bucket. You can also do a sync park to an S3 location. Uh so if you do execute it's an intermediate bucket that gets cleaned after a couple hours has like a life cycle on it.
>> Uh but you can also do sync paret to your favorite location in S3.
>> No but I remember the first time I was playing around I was kind of oh where's the collect where's the collect but then you also have to think of okay but this is running on the cluster not your local machine. So like that's a bit of a different world.
>> Yeah exactly when you're playing. And that that's also uh we're that that's something that uh we'll have to to explore what works nicely with the API because uh you could also do like collect engine is distributed is something that's been recommended by some people. Uh we'll just see where we end up by just trial and error.
>> I mean the first time I tried it, it was like oh it's not the way that polar normally works. But the more I started thinking about it, the more I also started thinking well you also don't want to collect to your machine by accident, right?
>> Yeah exactly measure.
>> Right.
>> So as you can see this is our cluster.
So we've got a set of nodes. They're already starting to work. You can see the the network utilization pop up and in it the first couple phases already done. So we've got the this is the this query as the logical plan. So that's kind of like this graph we saw earlier except with fancy >> it's very fancy get some some of that UI maro but anyway >> we even have like the the media map.
>> Yeah.
>> Very nice.
>> Oh very cool.
>> And so this the the logical plan again is like what are we going to do? So that's pretty much my my query translated to a graph with optimization on it. Uh so I think you can click a node and get like some info. So we see the projection we only get these two columns from this file.
>> Uh even though these are all available.
So that's uh this file apparently it doesn't have table statistics that we can use. All kinds of stuff you can get from there.
>> The stage graph. So because we're working on uh multiple workers uh some of sometimes we have to shuffle data around because for example if you do a group by but you have 10 machines that may have different chunks of the data on there for a group by you need to know for sure that you have all the data for that group key right you can't do an aggregation on all of the groups data if you don't have it all so uh there's there's some operations that create a shuffle boundary as it's called and this way you hop between stages between stages data can be shuffled to machines.
It's already done.
>> Uh we could just rerun them fast.
>> Yeah.
Um so one of the things we saw here is I think uh the we on the order key we get the the length of the supplement key and here this this gets put into this temporary because this is on the on the on the the before the data shuffled right >> to reduce how much data we're shuffling we already do a pre-agregation. So we just get like the length of sub key and uh put it in a temporary column and then after the data shuffled we sum that data together. So we have effectively the the len but we have to we we can send less data over the line because we already pre-agregated some of it. So there's there's a lot of these these uh smart tricks you can do to prevent having to shuffle data because uh network transfers can be quite costly. uh and that's what we want to reduce like the amount of bytes that we have to shuffle across these uh machines and uh on top of that we also try to keep data on the worker that's already there so we don't have to shuffle that out so there's a lot of uh room uh for things we can do there too um yeah uh >> but the but the thing has ran and this is also one of those things where if you feel something weird is happening this is a thing you could watch live like log into the the polish side of things but you do gota overview of something that can happen live.
>> Correct. Yeah. So, this is as we just I can show it a bit again a bit later, but for now, let's just walk through this.
But you can see this the data flow through here. It's like if it's white, it's it's scheduled, it's waiting. If it's blue, it's running. And if it's green, it's it's completed.
So, this this whole phase is kind of like a subquery, right? And uh this two has a physical plan like we talked about before like how it's going to be executed on this machine. And uh so the the this this whole uh this query as a whole is being chopped up into these stages. Every stage has its own uh is its own physical plan and uh all the workers get this physical plan plus the plus plus their partitions of data that they have to process on there. So and the cool thing is that uh we don't have like a special engine running here. This is just we give this to the streaming engine that actually also runs single node. It's the same exact engine just is this tweaked so we can uh pass this distributed information on there too.
And this helps us a lot because if we do any improvements to the streaming engine, there's nothing we have to port because it's already in there. So this >> this this reminds me a lot of the do you remember Apache Flink? It's a little bit less invogue now. It was like a pretty big thing a couple years ago >> by name. Yeah.
>> They're um so Spark did micro microbatching. Flink actually did streaming. And the the cool phrase they always had was if you've solved it in streaming, you have also solved it in batch.
>> Yeah.
>> Uh which is effectively what I'm seeing here.
>> Yeah. So this is uh this is one of these physical plans, but then with kind of live how far it's being executed.
>> There's all kinds of stuff you can dive into. So you have like a lot of detailed information here. So stuff because this can be a bit overwhelming, right? So the things I usually look at, we currently have like two uh displays of how the uh how the query was executed. So everything's green already done, but you can see that we've got like these these uh these indicators of how much time was spent in a node and then the blue is IO time. So we see that this this scan took most of the IO time like the the biggest file probably that's being read across the network. Um and we can also see which nodes in this graph took up the most compute time. So that's here this this group by took 60% of the time which for a group by kind of makes sense.
>> Yeah.
>> And you can just you can get all kinds of >> uh information. You can see this uh the the morsel size which by default is 100,000 rows. You can tweak that too.
It's also something that we try and tune to uh get the optimal performance out of there because you don't want these too large because then they don't fit properly in the CPU cache. But if they're very small then you're moving a lot of stuff around even though there's barely anything.
>> You don't really use the cache anymore because you're moving around the whole time. Yeah.
>> Uh all kinds of uh information like how many rows going in, how many rows are going out, how many total time you spend in this node. Uh so this this gives you a lot of metrics to to see uh into the engine, which is something that uh you don't really have if you use the the the single node version.
>> I'm going to I'm going to ask a weird question and this question is probably going to end up being a feature request.
Um so let's say that I'm I'm World I'm Blizzard and I have World of Warcraft. I want to find bots and then one thing I want to do is I want to sessionize some data. But knowing the number of rows that go into the session step and out of it, I mean that's cool, but maybe I also want to know the number of players and that's like a domain knowledge kind of a thing.
>> Is there any way for me to like tag that I want to have aggregations of some specific column to make sense of what's happening in here? Because I can imagine sometimes rows is fine >> to tell you what's happening to your query, but sometimes you want to add just a little bit more context and say something like, oh, number of players might be important.
>> This is a feature request. anyone was going to go there.
>> Mhm. Yeah. No, I think because one of the as you see up here is more the group by specific information. We're still working on making more information on these operations available. So, this is mostly >> uh uh internally we've got we have there's a lot of information available in the engine to display. Uh we just need to dispatch that properly to the the interface and have the front enders create elements for this. So, I I wouldn't be surprised if you can get more information out of these nodes in that way and just pop pop it up in here.
Uh >> okay. But there okay that's that's a UI thing that might actually be added later.
>> It's just the I'm assuming different priorities every week's different.
>> Yeah. So there's plenty of things to do.
Unfortunately we can't do everything. We would love to but that's why we have to pick and choose.
>> Sure.
>> Uh so I think one of the the other things that's quite important to us now is having a self-s serve onboarding flow so people can actually start using the product.
>> That's a bit more up there than a more tweaked tweaked information display.
person can make account and start paying versus Vincent wants to do specialization through World of Warcraft demo. Yeah, >> exactly.
>> Makes sense. Makes sense.
>> So that's that's uh effectively what we have uh on the the query planning side.
>> Um I think Yeah. So it always ends in a sync in this case. I think it's going to this super well- definfined temporary location.
>> Yeah. Yeah. I mean it's it's it's clearly temporary.
>> Yeah. Exactly. And I think it's also something that we current wait is it currently return in this very very provisionary uh widget where you can see like this is the output location like okay interesting >> okay no but but okay >> with a small small preview and this is something that will tweak over time as well but it gives you a little bit of some information at least back to the client. Yeah, I can also imagine like something you might add there is like a link to this interface over here. And I remember showing Mimo to Richie the first time and he was like, >> "Wait, there's widgets."
>> Exactly.
>> So, there's definitely some some room for improvement on that side as well.
It's just most of the the development attention goes to the the first.
>> Yeah, that makes sense.
>> We've got some information in terms of how the the nodes are doing. right now they're just idling so it's not very interesting but this this can help a lot in terms of seeing am I memory bottlenecked am I CPU bottleneck you don't want to see memory sit at 10% because then you can just easily scale down the the >> yeah if your CPU's at 100 but your memor is at 10 then something is off >> yeah so generally what we want to see is we always want to be compute bound because that means that nothing is restricting how much we're we can process uh but in practice you'll also see some network bound issues and now with streaming uh on mult multiple machines you generally see that you can uh it's it's better to pick the compute optimized instances than the memory optimized or general purpose because then you just get too much RAM available. Mhm.
>> Uh and it it helps to take take smaller machines so can use this bursting for the network to kind of reduce the bottleneck here. And then we just have the the compute the vcpus to work with and uh something with the the fancy alpha logo is our very own uh >> basic notepad marrimo type thing. So this is is mostly uh we saw some people that just want to fire some quick queries and get done with it and don't want to have to >> don't need charts or anything like that.
Exactly. So it's it's uh it allows you to to kick off a cluster quickly fire off a query in here and then get some data on S3 that you want to work with later for example then it's something you can can do in here but it's also something that our devs are working with. So this is uh it was practical to have this available. Uh if you want to get a bit more serious then of course.
>> So one thing so okay you're talking to someone who works over MIMO. Um we added a feature like I think last time we spoke we might not have added this feature. Could you go back to MIMO?
Yeah.
>> Um, so if you go all the way back on top, uh, and I think, uh, normally there's like icons to the left hand side. Are they?
>> Oh, I'm zoomed out, I think.
>> Oh, that's Oh, there you go. Um, yeah.
So, if you go to the file browser all the way on top. Yep. And then remote storage above files.
>> No. So, there's all the way.
>> Yes. So, so what you can do >> is as the name implies you can actually add remote storage and then the whole idea here is that we might have a nice little link for you to actually yeah we work uh you might have heard of coreweave um the no so like uh and I can definitely imagine that features like this might also help out a whole lot so you can definitely imagine that at some point a user will have some sort of a key that will tell us okay this user can access this bucket and they have access to these data sets etc u and that might also be a nice way to link to uh some of the Polish tools.
>> Nice. Yeah, >> you do get into this whole space of like okay, how do we deal with access tokens and secrets and keys and every organization has six different ways of doing it.
>> Um so like the version one now, but um definitely with regards to because I think the main place you read from is usually S3. Like I can imagine there's there's big Postgress instances out there, but it's probably not the the the input source for what you folks are doing on the cloud side, I think.
>> Yeah. Yeah. And so we mostly work with S3 because that's where the >> the big file.
>> Yeah. So uh the the Polar cloud that we started out with was only on AWS and through the marketplace you can install it as like a cloudformation stack thing into your into your cloud environment.
And uh the the whole idea was you have it in your environment. Your data stays there. There's only one Polar element that sits there and practically waits until it's called and then creates and provisions these machines, runs a query and tears it down again. So you only pay when you have a query to run. Um and uh we uh now started also noticing there's a lot of attention for more the the on-remise uh type of setup where uh we're able to to roll out on a Kubernetes cluster. So if you run these machines yourselves like even on uh the the AWS managed clusters, you can also roll it out. So instead of like us managing the hardware of your environment, you just have the hardware and roll Polar Cloud out as software on there. That's what our focus is on right now.
>> And but you do need to have Kubernetes for that at this point.
>> Okay.
>> Yeah. We're we're also thinking about bare metal. Uh but that's not something we're not there yet. It's on the road.
>> Here comes Vincent with his feature request, but I've got a home lab.
>> Kick off the Kubernetes cluster and we can go.
>> Yeah, exactly. I need to put Kubernetes on my home lab.
>> Yeah, >> that makes sense. I mean, makes sense.
Like there's a bunch of organizations who run on their own hardware. I know a bunch of them here in the Netherlands.
And if they do, like nine times out of 10, it's Kubernetes. So, it makes complete sense.
>> Yeah, exactly. So, that's what we saw as well. And uh so the that's that's where we're focusing right now. We've been working with a couple of design partners that are already using this. We've built a lot of things with them. Found a lot of uh interesting use cases like I think one of the use cases that we saw is some a party that had in their data frame stored binary information stored images.
>> Yeah, told me about that one. That was that >> they had like massive data frames because of these images and for the rest of it like the joins were very manageable. It's just that all these image this image information this binary information was kind of attached to this data frame and when joining was also sent across the network. So we have to like optimize for uh >> it was like proper it was proper binary right like not B 64 image encoded or anything. No like proper binary like in the row itself and these are big these images >> uh well large enough to get into the terabytes. So >> yeah you got all these network shuffles and machines flooding and getting out of memory because they were all this information which wasn't necessary for the joint was still part of the data frame that was being communicated. So that's I think the best way to to get to see how your tool should grow is to get it out there and work together with companies that have have cool problems.
>> Yeah. Because you can't make some of this stuff up.
>> No. No. And it's also uh we have a whole road map of different kinds of optimizations that we can still do. Uh and this helps us prioritize to see what will make the most effect in the real world.
>> Yeah. And I can also imagine you're going to get requests from people who want to do fancy things with embeddings and people who want to do no but I want time series and mountains and mountains of work.
>> Yeah. So we've also uh talked with people that have uh work with matrix matrices and work with sparse matrices and like well we're a dataf frame library so at one we have to to choose what kind of functionality we invest into and what is going into the binary because we also want to keep that slim enough and uh that means that we choose to stick to core data frame functionality and if you want to go very deep for example in statistics or in matrix calculations there's usually other packages that just do that better already and uh We we do give room for uh building plugins that allow you >> pretty pretty elaborate plug-in ecosystem at this point.
>> Yeah. Yeah. So I think there's for especially for statistics there's already a couple of plugins out there that uh are sometimes even written natively in rust and then as you load a plugin this uh runs on the polar engine so you don't lose any performance. uh it's all Rust uh because you can also expand it with expand your the Rust engine with Python logic but then it has to do these round trips from Rust to Python with a >> and even if you have something like Numbai it's still all memory like the I had this one package the other day >> something can help me generate random numbers and it just works with a lazy frame amazing >> we can now do simulations and it's just but and I but I can also appreciate that it's not something you want to have in pandas core either >> yeah exactly >> because who uses panda who who uses a dataf frame library to do simulations.
Well, not everyone.
>> Exactly.
>> And that's that's uh so that that's what we made built the plug-in space for.
>> Cool.
>> I guess there's like so really cool demo. Um we should stop at some point because we're already a little bit um but there is always a question in this case I want to ask because you know you're also exploring uh like uh you guys are cloud products. You're looking a little bit what Mibo could do. So one question I would have is like if you could dream a little about like oh that might be a cool feature for us dataf frame minded folk like is there anything that you think that it's not ino quite yet right now but maybe should be like we have we have a cool like data frame widget we've got a few things I'm just kind of wondering like if you could dream and tell Vincent what to do what would you tell Vincent to do? Um well I think the like the because it it's been a while since I've been uh working as a a data scientist properly like just diving into a mimo and these notebooks to to work with this uh I've seen some of the features appear where it's just very easy to to uh to to display the data set that you're working with and make slices in the data to keep working on further. I think that's already very useful uh to get like these these visual ways of slicing your data set and like this is like the cluster of data I want to work on now and you can just select it by clicking and dragging a box like that's super chill and that's something that that changes the way of working with just this tool. So I think it already goes quite far in that direction. Don't really have something that that pops up quickly.
>> Okay. No. So, I always like to ask these sorts of questions because one thing that um we've talked about this in the beginning like it does feel like there's a new iteration of quote unquote the data stack. Uh and that also means that people can like the awkward thing is that people have to sort of get used to the fact that the old way of working is not going to be the new way of working.
But it also means that sometimes you have to give yourself create like permission to be creatively free if it makes sense because you're still stuck in the old way of working.
>> Yeah.
>> Yeah, that makes sense.
>> So, that's always why I like to end with like if you could dream a little what would you do? Um, but I can also definitely imagine because you're a little bit more on the product side of things that maybe the first thing you're trying to worry about more is like, okay, how can we handle all these different queries and that's like the main thing that's top of mind at the moment.
>> Yeah. Yeah. For me, it's mostly finding like we've got a lot of these new features coming out and it's like how can I display this in a nice way so it's it's visible because it's just a lot of engine stuff >> and uh often times stuff is just faster.
But it's it's always fun to see how can I visualize a little bit more about what that actually means or how should you change your code to make use of this. Uh and for that it's it's relatively the basic functionality of Mimmer already covers that.
>> Okay. Now, but one thing I can imag something that would make it easier or nicer to write a to show a really cool query plan like the little DAG thing where maybe you could zoom in and out you could group things. Something along those lines that could that could still be like maybe a cool thing to explore but then that's the the main thing.
>> Yeah.
>> Cool.
Nice. Um, yeah, we're we're pretty soon we're going to go backstage and we can talk Dutch again. Uh, but but until then, uh, like thanks for showing up.
Uh, again, Godspeed with everything you're doing because Polar is a cool product. We love using it over on the Remo side of things as well. Um, so yeah, we we'll definitely be in touch if people still have questions. Uh, I guess they can find you on GitHub and on the you have a pretty active LinkedIn following, I suppose, with Pol.
>> Yeah, so I think the sources where we're mostly active like we give a lot of updates out on LinkedIn. That's like where the where we see the main audience uh being attached to us and that's where we kind of give the release highlights or new features or blog posts. We have like a polar aggregate that we do quarterly which keeps everyone up to date because not everyone wants to dive into the release notes every time.
>> That's it's a mailing list people can sign up for.
>> Uh yeah, that too. But also that's that's something we post on LinkedIn and use like on the website a blog post with it. Uh and we have a discord that's very active especially if you use polars. I would recommend joining that one. Uh there's a lot of very very smart people in that discord that know way more about how to use polars in wacky ways than I do. Uh and it's it's always nice to hang out there to see uh what's uh what's going on in there. There's a lot of people come there with showcases, benchmarks, new plugins, for example.
That's usually where we see the most interaction.
>> Cool. Well, with that said, uh what I'm going to go ahead and do now is hit stop on the live stream and thanks everyone for tuning in as well. Um, people who want to tune in later today, I will be interviewing Hamill Hussein. We're going to talk a little bit about notebooks in general and LLMs and AI and how all that stuff works. Um, but thanks for tuning in to this one and I'll see you all later. Once again, thank you as well.
Bye.
>> Okay.
best for nes. Uh
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











