This video correctly shifts the focus from query engine hype to physical data layout as the true driver of performance. It provides a clear, actionable framework for mastering data pruning through partitioning, bucketing, and clustering.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Partitioning vs Bucketing vs Clustering: How to Make Queries 100x Faster
Added:Imagine getting a 100x performance increase just by changing a single aspect of how you organize your data.
So, you've got two tables with same rows, same columns, same exact data byte for byte. You're in the same query against both. One could come back in 2 seconds and cost pretty much nothing.
The other could take 3 minutes and scan 1.2 terabytes that I get billed for. And it's 100x difference in cost and speed with the same data and query. The only thing that's going to be different is how files are organized underneath the table. Um because in modern data systems, your query speed is decided a lot less by your engine and more by the physical layout of your data. Um so, you know, looking at kind of two query plans, right? On the left, we have where event date equals 2026 6 14. Engine does a full table scan. It reads every single file, all 1.2 terabytes, then it, you know, throws away 99% of it cuz only matches for that single day. On the right, same query, but the engine reads exactly one folder, which is, you know, maybe a day's worth of data. It ignores everything else and it never opens any other files. And that second behavior has a name. It's called pruning. Um the engine looking at your filter, it figures out what chunks of data it could possibly contain matching rows, and then it skips the rest that don't match those parameters. And the entire game of fast analytics is getting the engine to skip data instead of reading it. And then there are three levers that control how much the engine can skip that we're going to go through today: partitioning, bucketing, and clustering. And people will mix this up all the time and potentially use the wrong one, and you'll pay for it in slow dashboards and cloud bills. So, today I'm going to go through exactly what each one does, failure modes to avoid with each, and the decision guide, so you know what to reach for for your next data project. Um and let's start with the one that everyone probably knows, which is partitioning.
So, for partitioning, you essentially, you know, it's the simplest of the three. It's where most people will start. You know, you pick a column, you physically split your data into separate folders based on that column's value.
So, you know, in this example, say you partition a table by event date. On storage, you'll get one big pile of files and a directory structure like, you know, you have on the screen, right?
Event date and the parquet file for that event. So, every row physically lives in the folder for its date and the date isn't stored inside the files anymore.
It's encoded in the folder path as the partition key. Um and why that matters is that what that means is then when you run, you know, this query select count from events where event date is, you know, the 14th of June, the event's going to look at that where clause and then only look at folder names and say, "Hey, only need the event date, you know, 2026 0614."
Reads that one directory, never touches the others. Um so, it doesn't open them, doesn't download them, doesn't get billed for them. Um and that's partition pruning. So, on a system like BigQuery, Athena, Spark, or Hive, probably the biggest single win you can get cuz time series data, which is, you know, a lot of data, uh almost every query has a date range in. Um and then it also works for ranges, too, right? You could select, "Hey, when event date is between, you know, the 1st and the 7th of June." That'll prune it down now to seven folders out of how many, you know, hundreds or thousands you have. And the engine does the math on the folder names before it reads any, you know, any information. Um and one of the biggest mistakes here that you can make is, you know, partitions feels free. So, people will partition by everything, you know, by date, by hour, by country, by user ID. And then you can see partition kind of fall apart because of kind of two main traps. One, small files trap. Each partition is at least one file. So, if you partition by date, an hour, and country, you know, that's 365 by 24 by 100, nearly 900,000 partitions, right?
So, if partition only holds a few hundred rows, you now have a million tiny files and engines are really bad at tiny files cuz every file has a fixed overhead. You know, you have to open it, read its metadata, and close it. So, a million 50 KB files is dramatically slower than 1,000 50 MB files holding the same amount of data cuz your job will spend all of its time in file bookkeeping instead of actually reading data. Um and it's, you know, kind of referred to as death by 1,000 files.
Then, problem two, you have high cardinality file uh columns, right? So, cardinality means a number of distinct values. Data has low cardinality, a few thousand values ever. But, user ID might be 50 million distinct values. So, if you partition by user ID, you're not you're asking the engine to create 50 million folders. Again, it's not a table anymore. It's then just going to be a file system disaster.
So, make sure you're choosing the higher, you know, lowest cardinality columns as your kind of keys for partitioning. So, as a rule, you know, partition on low cardinality columns that show up in your where clauses most often. So, date classic, sometimes region, and then aim for partitions that are big enough to be efficient, you know, hundreds of megabytes to a gigabyte each, not kilobytes. And if a column has thousands or millions of distinct values, partitioning is not the right tool to use for that.
So, what do you do for those kind of high cardinality columns where you do need to filter and join on those all the time? That is where our second feature comes in, which is bucketing. So, bucketing solves the exact problem of partitioning can't, which is, you know, organizing a high cardinality column without blowing up into millions of folders. So, the trick is you don't make one folder per value. You decide up front on a fixed number of buckets, like 256, and then you assign every row to a bucket using a hash function. So, the mechanic is, you know, you take a value, run it through a hash, and then modulo the number of buckets. So, user ID equals 8675309 always ends in bucket 42. Every row for that user goes to bucket 42. And a different user might also land in that bucket because, you know, buckets will hold many users. But, the key thing is that all user all rows for a given user are guaranteed to be in the same bucket, and there's only ever 256 files, no matter whether you have a thousand users or 50 million. So, the physical layout is just a fixed set of files, you know, and that's kind of the whole point, right? So, there's no explosion of files with really large data sets.
And then the reason why you care and why you want to do this is there's two big wins with this approach. Joins first.
This is, you know, the kind of biggest killer feature. Normally, when you join two tables on user ID, the engine has to shuffle. It sends data across the network so matching keys end up on the same machine. And shuffle is the most expensive thing in distributed computing. Um, it's network and disk spill is something that'll make your Spark job sit at 99% for 20 minutes. Um, but if both tables are bucketed by user ID into the same number of buckets, then bucket 42 of table A only ever needs to join against bucket 42 of table B, and the matching key is already co-located, so the engine can do a bucket join, read bucket 42 from both sides, join them locally, and move on with no giant shuffle. And on large joins, this can turn into a 20, you know, turn a 20-minute job into a 2-minute one.
Um, and also, next benefit is aggregation second. Uh, if you group by user ID and the data is already bucketed by user ID, every user's rows are already gathered in one bucket, and the engine can aggregate each bucket independently without reshuffling everything first. Um, and just kind of a couple of things to note also with this approach is that bucketing does not give you pruning the same way partitioning does. So, if you filter where user ID equals X, a smart engine can compute the hash, and you know, jump to bucket 42.
Um, and that's bucket pruning, and some engines will do that, but a between range on the bucket column won't prune anything because hashing scatters nearby values across the buckets on purpose.
And bucketing is built for equality joins and grouping, not range scans.
Um, and then the number of buckets is also a commitment, right? You pick it when you write the table, match that number across tables when you what you want to join is what makes it the bucket join work. So, change your mind later, you're going to have to rewrite that data. So, pick a bucket count that keeps each bucket in that sweet spot, you know, like a few few hundred megabytes to a gig, that you can keep consistent across the tables you join together. Um, and so, and kind of, you know, summary there is you want to use partitioning for low cardinality filter columns, bucketing for high cardinality join and group by keys. Um, and then there's the third lever, which we'll go next, and is clustering and Z-ordering.
So, clustering is probably the most subtle approach. Um, it's probably the also the one that's getting the most traction these days because modern engines will lean on it hard. Um, you know, partitioning and bucketing decide which file a row goes into. Clustering decides the number of or the order of rows within inside the file. So, you sort the data so that related values physically sit next to each other on disk. Um, so why does ordering matter if the engine has to read the file anyway?
Because of metadata. Um, modern file formats, Parquet, ORC, and table formats, Delta, Iceberg, store the min and max value of each column for every chunk of data. A chunk might be a Parquet row group or the whole file's stats in the table manifest. So, when you filter where country equals Japan, engine looks at each chunk's min max for country before reading it. Chunk's range is Argentina to Brazil, Japan can't possibly be in there, so you skip it, you never read it. And this is data skipping, which is essentially pruning at the file block level instead of the folder level. And the catch to it is that it only works if the data is sorted. So, if country values are scattered randomly across every file, then every single chunk has a min of Argentina, max is Zimbabwe, and min max will tell you nothing cuz every chunk might contain Japan, so the engine reads all of them, and the metadata is useless. And then cluster, that is, you know, if you sort by country, now all the Japan rows are physically grouped into a handful of chunks, and every other chunk's range excludes Japan and gets skipped, which is the win. Um, so in BigQuery, this is literally a table option, cluster by country, device type, engine keeps the data sorted sorted for you. In Delta Lake and Iceberg, you have optimize Z-order by and sort order specs that will do the same job. Um, and then kind of a quick side on Z-ordering cuz it's clustering's clever cousin. Plain sorting works great for one column, but if you sort by country then device type, you get fantastic skipping on country and mediocre skipping on device type cuz it's only sorted within each country.
Z-ordering uses space-filling curve, a Z-shaped curve, to interweave multiple columns together so they're all clustered reasonably well at the same time. So, a query that filters on country, a different query that filters on device type, third that filters on both, will all get the same data skipping. Um, and you give up a little on any single column versus a pure sort, but you win across a set of filter columns. Um, and notice how this, you know, complements partitioning, right?
You partition by the data to prune at the folder level, and within each day's partition, you cluster or Z-order by country and device type to prune at the file block level. So, two layers of skipping stacked on top of each other, which is the combination I kind of want to land on. Um, and so, now you have all these different, you know, different you know, options, how do you choose the right one?
So, when it comes to actually using them, it's pretty simple. There's three levers. Um, you reach for partitioning when you got a low cardinality column that shows up in almost every where clause like most of the time will be date. It needs to be a manageable number of partitions, you know, hundreds or low thousands, and each partition should be hefty with hundreds of megs. Um, and then, you know, that's going to be date, maybe a coarse region or category. If the column has thousands of plus distinct values, don't partition on it. Then, the next lever is going to be bucketing. You have a high cardinality column that you repeatedly join or group by, you're going to want to use bucketing. Um, so you can colocate the keys, the engine skips that giant shuffle. Pick a fixed bucket count, keep it consistent across the tables you join, and remember it's equality, not ranges. And then finally, reach for clustering or Z-ordering when you have high cardinality columns you filter on, especially if it's several of them and you want the engine to skip file blocks using min-max stats. It's a lot lower maintenance than bucketing on modern engines. It's also what Big Query and Delta and Iceberg use, and it'll stack cleanly inside partitions. Um, and then finally, you know, this isn't either or, right? The best layouts you will use these in, you know, different strategies in conjunction with each other to achieve really good results. So, don't feel constrained that you only can use one or the other. Use them all when possible. Um, so, I hope you enjoyed this video. I hope you learned something. Hope you have a great rest of your day. Data Guy out.
Related Videos
This Machine Still Runs on Punch Cards
WaltersShortsChannel
6K views•2026-06-10
GitLab’s Manav Khurana: AI Agents, Orbit, and the Future of Coding
TechVoices-live
374 views•2026-06-10
I Made an Antivirus That Secretly Attacks Scammers
ScammerPayback
153K views•2026-06-13
Your Python loops are probably slower than they need to be. One simple decorator can instantly bo...
60SecondStack
236 views•2026-06-11
Leetcode Weekly Contest 506 | Life's boring these days
Pudeesht
2K views•2026-06-14
Programming in English
MattGodbolt
584 views•2026-06-14
I thought this feature would be easy to deploy... I was wrong.
dreamsofcode
815 views•2026-06-10
How to design a good function in Python
Indently
1K views•2026-06-13
Trending
This 80 year old corn is dangerous
NileBlue
1569K views•2026-06-10
Everyone around him is insane.
LeoinFrames-1
2406K views•2026-06-13
It does nothing, but men have worn it for 400 years. Behind the origin of the necktie
FineasJackson
1423K views•2026-06-12
Scientists Create Indestructible Medicine
DrBenMiles
628K views•2026-06-11











