This blog explores how Clustering Keys Data Pruning, and the Search Optimization Service (SOS) enhance query efficiency in Snowflake. It explains how clustering keys physically organize data into micro-partitions enabling faster queries by reducing unnecessary scans. Data pruning leverages metadata to skip irrelevant partitions further improving performance.
Clustering keys are a way to organize data within tables to optimize query performance. They determine how data is physically stored in micro-partitions by grouping similar values together.
Clustering keys are specific columns or expressions defined on a table that dictate how Snowflake organizes data into micro-partitions. By clustering, Snowflake ensures that rows with similar values in the clustering key columns are stored together, enabling efficient query execution.
When no clustering key is defined, data is stored in the order it is loaded. This can lead to scattered organization, making queries less efficient.
Example: Table with Random Data Distribution
Micro-Partition 1: [Jan, Mar, Dec, Aug]
Micro-Partition 2: [Feb, Sep, Nov, Apr]
Micro-Partition 3: [May, Oct, Jul, Jun]
Querying for data from “Jan” requires scanning multiple micro-partitions because the data is not grouped together.
When a clustering key is defined (e.g., CLUSTER BY (date)
), Snowflake organizes the data so that rows with similar clustering key values are stored together.
Example: Table Clustered by Date
Micro-Partition 1: [Jan, Feb, Mar]
Micro-Partition 2: [Apr, May, Jun]
Micro-Partition 3: [Jul, Aug, Sep]
Now, querying for “Jan” only scans one micro-partition instead of multiple, significantly improving performance.
Clustering keys improve query performance by:
CREATE TABLE sales_data (
transaction_date DATE,
region_id INT,
amount DECIMAL
) CLUSTER BY (transaction_date);
ALTER TABLE sales_data CLUSTER BY (transaction_date);
Query:
SELECT * FROM sales_data WHERE transaction_date = '2025-01-01';
Query:
SELECT * FROM sales_data WHERE transaction_date = '2025-01-01';
transaction_date
, Snowflake prunes irrelevant partitions and only scans those containing “2025-01-01.”date
or region_id
if these are common in WHERE clauses.region
) and then higher-cardinality ones (e.g., date
).2025-01-25 10:00:00
), truncate them to dates (2025-01-25
).Clustering keys are most effective when:
By defining appropriate clustering keys based on query patterns and table size, you can significantly reduce query times and optimize resource usage in Snowflake.
Data pruning is Snowflake’s way of skipping micro-partitions that have no relevant data for a given query. Each micro-partition in Snowflake stores metadata about the range of values it contains. When a query includes filters on clustering keys or other columns, Snowflake checks this metadata to determine which micro-partitions likely hold the required data and reads only those.
The Search Optimization Service (SOS) in Snowflake is a feature designed to improve the performance of highly selective queries, such as point lookups or queries with extensive filtering on large tables. It achieves this by creating a specialized data structure called a search access path, which helps Snowflake quickly locate relevant micro-partitions while skipping irrelevant ones.
SOS is most beneficial for:
WHERE id = 123
).LIKE
, ILIKE
, RLIKE
, etc.VARIANT
, OBJECT
, or ARRAY
columns with filters like ARRAY_CONTAINS
or equality predicates.ST_INTERSECTS
or ST_CONTAINS
.A query like:
SELECT * FROM CUSTOMER WHERE C_CUSTOMER_SK = '4722123';
After enabling SOS:
ALTER TABLE CUSTOMER ADD SEARCH OPTIMIZATION;
SELECT * FROM CUSTOMER WHERE C_CUSTOMER_SK = '4722123';
Enable SOS on a table:
ALTER TABLE my_table ADD SEARCH OPTIMIZATION;
Estimate costs before enabling SOS:
SELECT SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS('my_table');
Disable SOS if needed:
ALTER TABLE my_table DROP SEARCH OPTIMIZATION;
Check enabled columns:
DESCRIBE SEARCH OPTIMIZATION ON my_table;
Feature | Clustering Keys | Search Optimization Service |
---|---|---|
Focus | Range-based queries (e.g., date ranges) | Point lookups and highly selective queries |
Data Organization | Physically reorganizes data | Adds metadata without reorganizing data |
Maintenance Overhead | Higher for frequent updates | Moderate; automatically maintained |
Use Case | Range filters, joins | Equality filters, substring searches |
By enabling SOS strategically, you can achieve significant performance gains for specific query patterns while avoiding unnecessary costs.