Optmizing Query Performance with Clustering Keys in Snowflake

This blog explores how Clustering Keys Data Pruning, and the Search Optimization Service (SOS) enhance query efficiency in Snowflake. It explains how clustering keys physically organize data into micro-partitions enabling faster queries by reducing unnecessary scans. Data pruning leverages metadata to skip irrelevant partitions further improving performance.

Akhil Gurrapu Snowflake Data Analyst Data Engineering DevOps

Dec 21, 2024

Clustering Keys in Snowflake

Clustering keys are a way to organize data within tables to optimize query performance. They determine how data is physically stored in micro-partitions by grouping similar values together.

What Are Clustering Keys?

Clustering keys are specific columns or expressions defined on a table that dictate how Snowflake organizes data into micro-partitions. By clustering, Snowflake ensures that rows with similar values in the clustering key columns are stored together, enabling efficient query execution.

How Clustering Works: Visualizing Micro-Partitions

Without Clustering Keys (Default Behavior)

When no clustering key is defined, data is stored in the order it is loaded. This can lead to scattered organization, making queries less efficient.

Example: Table with Random Data Distribution

Micro-Partition 1: [Jan, Mar, Dec, Aug]
Micro-Partition 2: [Feb, Sep, Nov, Apr]
Micro-Partition 3: [May, Oct, Jul, Jun]

Querying for data from “Jan” requires scanning multiple micro-partitions because the data is not grouped together.

With Clustering Keys

When a clustering key is defined (e.g., CLUSTER BY (date)), Snowflake organizes the data so that rows with similar clustering key values are stored together.

Example: Table Clustered by Date

Micro-Partition 1: [Jan, Feb, Mar]
Micro-Partition 2: [Apr, May, Jun]
Micro-Partition 3: [Jul, Aug, Sep]

Now, querying for “Jan” only scans one micro-partition instead of multiple, significantly improving performance.

Why Use Clustering Keys?

Clustering keys improve query performance by:

Data Pruning: Snowflake can skip irrelevant micro-partitions during queries (e.g., filtering by date).
Reduced I/O: Only relevant partitions are read, minimizing disk scans.
Better Compression: Grouped data often leads to higher compression ratios.
Optimized Query Execution: Queries on clustered columns execute faster due to co-located data.

Example of Defining a Clustering Key

Create a Table with Clustering Key

sql

CREATE TABLE sales_data (
    transaction_date DATE,
    region_id INT,
    amount DECIMAL
) CLUSTER BY (transaction_date);

Alter an Existing Table to Add Clustering Key

sql

ALTER TABLE sales_data CLUSTER BY (transaction_date);

Query Performance Comparison

Before Clustering

Query:

sql

SELECT * FROM sales_data WHERE transaction_date = '2025-01-01';

Without clustering, Snowflake scans all micro-partitions because the data is scattered.
Query time: High (e.g., scans hundreds of partitions).

After Clustering

Query:

sql

SELECT * FROM sales_data WHERE transaction_date = '2025-01-01';

With clustering by transaction_date, Snowflake prunes irrelevant partitions and only scans those containing “2025-01-01.”
Query time: Low (e.g., scans only a few partitions).

Best Practices for Clustering Keys

Choose Columns Frequently Used in Filters: For example, date or region_id if these are common in WHERE clauses.
Limit the Number of Keys: Use 3–4 columns at most to avoid high maintenance costs.
Order Columns by Cardinality: Start with low-cardinality columns (e.g., region) and then higher-cardinality ones (e.g., date).
Avoid High Cardinality Columns Directly: For example, instead of using raw timestamps (2025-01-25 10:00:00), truncate them to dates (2025-01-25).

When to Use Clustering Keys

Clustering keys are most effective when:

The table is large (multi-terabyte scale).
Queries frequently filter or join on specific columns.
The default organization of data (natural clustering) does not align with query patterns.

By defining appropriate clustering keys based on query patterns and table size, you can significantly reduce query times and optimize resource usage in Snowflake.

Data Pruning in Snowflake

Data pruning is Snowflake’s way of skipping micro-partitions that have no relevant data for a given query. Each micro-partition in Snowflake stores metadata about the range of values it contains. When a query includes filters on clustering keys or other columns, Snowflake checks this metadata to determine which micro-partitions likely hold the required data and reads only those.

How Data Pruning Works

Micro-Partition Metadata

Each micro-partition has min/max statistics for the columns it contains. If a filter value lies outside the min/max range of a micro-partition, Snowflake can safely skip scanning it.

Clustering Keys

By defining clustering keys, you physically group related data together. This organization improves pruning because micro-partitions become more uniform, making it easier for Snowflake to discard irrelevant segments.

Reduced I/O

Data pruning significantly reduces the volume of data read from disk. Fewer partitions scanned translates to faster queries and lower compute costs

Search Optimization Service (SOS) in Snowflake

The Search Optimization Service (SOS) in Snowflake is a feature designed to improve the performance of highly selective queries, such as point lookups or queries with extensive filtering on large tables. It achieves this by creating a specialized data structure called a search access path, which helps Snowflake quickly locate relevant micro-partitions while skipping irrelevant ones.

How Search Optimization Works

1. Search Access Path Creation

When enabled, Snowflake builds a search access path for the table in the background.
This path contains metadata about which values are present in each micro-partition.
Instead of scanning all micro-partitions, Snowflake uses this metadata to prune irrelevant partitions.

2. Bloom Filters

SOS uses Bloom filters, which help Snowflake identify where data isn’t located, further reducing unnecessary scans.
This mechanism is particularly effective for queries that return only a small subset of rows.

3. Automatic Updates

The search access path is maintained automatically whenever the table is updated (e.g., new data loads or DML operations).

When to Use Search Optimization

SOS is most beneficial for:

Point Lookup Queries: Queries that filter on specific values (e.g., WHERE id = 123).
Highly Selective Filters: Queries that retrieve only a small subset of rows from large tables.
Substring and Regular Expression Searches: Queries using LIKE, ILIKE, RLIKE, etc.
Semi-Structured Data: Queries on fields in VARIANT, OBJECT, or ARRAY columns with filters like ARRAY_CONTAINS or equality predicates.
Geospatial Queries: Queries using geospatial functions like ST_INTERSECTS or ST_CONTAINS.

Performance Example

Without Search Optimization

A query like:

sql

SELECT * FROM CUSTOMER WHERE C_CUSTOMER_SK = '4722123';

Scans all micro-partitions because Snowflake has no metadata to prune irrelevant ones.
Query time: 1.3 seconds (example).

With Search Optimization

After enabling SOS:

sql

ALTER TABLE CUSTOMER ADD SEARCH OPTIMIZATION;

Running the same query:

sql

SELECT * FROM CUSTOMER WHERE C_CUSTOMER_SK = '4722123';

Only relevant micro-partitions are scanned using the search access path.
Query time: Reduced significantly (e.g., from 1.3 seconds to milliseconds).

How to Enable Search Optimization

Enable SOS on a table:

sql

ALTER TABLE my_table ADD SEARCH OPTIMIZATION;

Estimate costs before enabling SOS:

sql

SELECT SYSTEM$ESTIMATE_SEARCH_OPTIMIZATION_COSTS('my_table');

Disable SOS if needed:

sql

ALTER TABLE my_table DROP SEARCH OPTIMIZATION;

Check enabled columns:

sql

DESCRIBE SEARCH OPTIMIZATION ON my_table;

Best Practices

Use SOS for tables with over 1,000 micro-partitions or large datasets.
Apply it only to queries with high selectivity (returning few rows).
Avoid enabling SOS on tables with frequent updates or deletes, as it increases maintenance costs.
Combine SOS with clustering keys for optimal performance when queries involve both selective lookups and range filters.

Comparison: Clustering vs Search Optimization

Feature	Clustering Keys	Search Optimization Service
Focus	Range-based queries (e.g., date ranges)	Point lookups and highly selective queries
Data Organization	Physically reorganizes data	Adds metadata without reorganizing data
Maintenance Overhead	Higher for frequent updates	Moderate; automatically maintained
Use Case	Range filters, joins	Equality filters, substring searches

By enabling SOS strategically, you can achieve significant performance gains for specific query patterns while avoiding unnecessary costs.