Pruning in Snowflake: Working Smarter, Not Harder

Abstract

Modern cloud-based data analytics systems must efficiently process petabytes of data residing on cloud storage. A key optimization technique in state-of-the-art systems like Snowflake is partition pruning—skipping chunks of data that do not contain relevant information for computing query results.

While partition pruning based on query predicates is a well-established technique, we demonstrate that its scope can be extended to LIMIT, top-k, and JOIN operations, dramatically expanding the opportunities for pruning across diverse query types.

Our analysis of Snowflake's production workloads reveals that real-world analytical queries exhibit much higher selectivity than commonly assumed, yielding effective partition pruning. We show that we can harness high selectivity by utilizing only min/max metadata available in modern data analytics systems and data lake formats like Apache Iceberg, reducing the number of processed micro-partitions by 99.4%.

Highlights

We presented four data pruning techniques that are used in the Snowflake data platform and evaluated them based on their pruning effectiveness on real-world customer workloads. The pruning flow is as follows:

Filter Pruning: All queries containing a WHERE predicate are subject to filter pruning, which removes data partitions by identifying the ones that do not contain any rows that would match the filter condition in the WHERE clause.
LIMIT Pruning: If the query is a simple LIMIT query without an ORDER BY, the query engine tries to reduce the number of partitions to the minimally necessary set of partitions such that the query result can be computed. This works for queries with and without WHERE clause.
Join Pruning: While executing a join, the probe side of a join is subject to join pruning, i.e., removing partitions from the probe side that do not contain join partners based on the collected values from the build side.
Top-K Pruning: Lastly, if a query is a top-k query, i.e., it contains an ORDER BY ... LIMIT, a specialized top-k pruning technique attempts to reduce the number of processed partitions by pushing runtime information about the current top-k heap down to the table scan.

While the first two techniques can be applied during query optimization, the latter two techniques require runtime information about the current state of the query execution.

The pruning flow can be visualized as follows:

Pruning flow in Snowflake across all query types. 58.7% of all queries have partitions pruned by filter pruning. 0.2% of all queries are first subject to filter pruning, then LIMIT-pruning. There is another 0.0002% of queries that have no partitions pruned by filters, but only have partitions pruned by LIMIT-pruning. Overall, 0.2% of all queries are subject to LIMIT-pruning. Let's go to join pruning now. Overall, 10.7% of queries have partitions pruned by join-pruning. 4.0% of all queries have partitions pruned by join pruning but have neither filter pruning nor LIMIT-pruning applied. 0.0001% of queries saw successful LIMIT-pruning before having partitions pruned by join-pruning. And 6.7% of queries had partitions pruned by filter pruning before getting at least one partition pruned by join-pruning. Lastly, 0.1% of all queries across all query types have at least one partition pruned by top-k pruning. 0.03% of queries only saw successful topk-pruning without having any other partitions pruned by other techniques. 0.007% had at least one partition pruned by join-pruning before successful top-k pruning. And 0.08% of all queries first had partitions pruned by filter pruning before top-k pruning. — Pruning Flow in the Snowflake Data Platform across all Query Types

Note that the figure contains distributions across all query types, including DML queries. The distribution of SELECT-only queries might be different as certain operators might be seen more often—e.g., DML queries tend to not contain LIMIT clauses and might have fewer joins.

As in interesting side-remark, we also objectified that it makes sense to optimize for small LIMIT k by looking at the relative distribution of of k across production queries:

Follow-Up Work and Scientific Discourse

No follow-up work has been published yet. Please reach out to the main author of this paper if you think your work should be included here.

Pruning in Snowflake: Working Smarter, Not Harder

Abstract

Highlights

Related Links

Follow-Up Work and Scientific Discourse

Cite this Paper