This article explores the integration of dbt with Snowflake, a comprehensive guide to using dbt with Snowflake. It covers the main concepts of dbt, including models, materializations, sources, tests, documentation, macros, and packages. It also discusses Snowflake-specific features like zero-copy cloning, time travel, and tasks. Additionally, it covers advanced concepts like incremental models, snapshots, seeds, and custom materializations. Finally, it provides best practices for using dbt with Snowflake.
dbt (Data Build Tool) is an open-source tool that enables data analysts and engineers to transform data in their warehouses more effectively. It’s designed to bring software engineering practices to the analytics workflow.
Key features:
Models are the core building blocks in dbt. They are essentially SELECT statements that transform your data.
Example:
-- models/customers.sql
SELECT
id,
first_name,
last_name,
email
FROM raw_customers
WHERE status = 'active'
Materializations define how your models are created in the data warehouse. Common types:
Example:
{{ config(materialized='table') }}
SELECT * FROM ...
Sources define the raw data in your warehouse that your models build upon.
Example:
# models/sources.yml
version: 2
sources:
- name: raw_data
database: raw_db
tables:
- name: customers
- name: orders
dbt allows you to define tests for your models to ensure data quality.
Example:
# models/schema.yml
version: 2
models:
- name: customers
columns:
- name: id
tests:
- unique
- not_null
- name: email
tests:
- unique
You can add descriptions to your models and columns, which dbt uses to generate documentation.
Example:
# models/schema.yml
version: 2
models:
- name: customers
description: "Cleaned customer data"
columns:
- name: id
description: "The primary key for customers"
- name: email
description: "The customer's email address"
Macros are reusable pieces of SQL that you can use across your project.
Example:
{% macro clean_stg_customers() %}
SELECT
id,
LOWER(email) as email,
COALESCE(first_name, 'Unknown') as first_name,
COALESCE(last_name, 'Unknown') as last_name
FROM {{ source('raw_data', 'customers') }}
{% endmacro %}
To connect dbt to Snowflake, you need to provide the following details:
Snowflake’s zero-copy cloning can be leveraged in dbt for creating dev environments:
{{ config(materialized='table', post_hook=[
"CREATE OR REPLACE TABLE {{ target.schema }}.{{ this.name }}_clone CLONE {{ this.name }}"
]) }}
SELECT * FROM ...
You can use Snowflake’s time travel feature in your dbt models:
SELECT * FROM {{ source('raw_data', 'customers') }} AT(OFFSET => -60*60)
This selects data as it was 1 hour ago.
While dbt doesn’t directly manage Snowflake tasks, you can create them as post-hooks:
{{ config(post_hook=[
"CREATE OR REPLACE TASK update_{{ this.name }} WAREHOUSE = transform_wh SCHEDULE = '60 MINUTE' AS CALL refresh_{{ this.name }}()"
]) }}
SELECT * FROM ...
For large tables, you can specify clustering keys:
{{ config(
materialized='table',
cluster_by=['date', 'customer_id']
) }}
SELECT * FROM ...
You can add query tags to help with monitoring and optimization:
{{ config(query_tag='daily_customer_update') }}
SELECT * FROM ...
Incremental models allow you to update your models with only the new or updated data, which can significantly improve performance for large datasets.
Example:
{{ config(materialized='incremental', unique_key='id') }}
SELECT * FROM {{ source('raw_data', 'customers') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
Snapshots are used to track slowly changing dimensions over time.
Example:
{% snapshot customers_snapshot %}
{{
config(
target_database='analytics',
target_schema='snapshots',
unique_key='id',
strategy='timestamp',
updated_at='updated_at',
)
}}
SELECT * FROM {{ source('raw_data', 'customers') }}
{% endsnapshot %}
Seeds are CSV files that you can include in your dbt project and reference in your models.
Example:
# data/country_codes.csv
country_code,country_name
US,United States
CA,Canada
UK,United Kingdom
You can then reference this in your models:
SELECT
c.*,
cc.country_name
FROM {{ ref('customers') }} c
LEFT JOIN {{ ref('country_codes') }} cc ON c.country_code = cc.country_code
dbt allows you to use and create packages to share code between projects.
To use a package, add it to your packages.yml
:
packages:
- package: dbt-labs/dbt_utils
version: 0.8.0
Then you can use macros from the package in your models:
SELECT
{{ dbt_utils.generate_surrogate_key(['order_id', 'customer_id']) }} as surrogate_key,
*
FROM {{ ref('orders') }}
You can create custom materializations for specific use cases.
Example of a custom materialization that creates a table and a backup:
{% materialization table_with_backup, adapter='snowflake' %}
{%- set backup_table = this.incorporate(path=["backup"]) -%}
{%- set target_relation = this %}
{%- set existing_relation = load_relation(this) %}
{%- set backup_relation = load_relation(backup_table) %}
{{ run_hooks(pre_hooks) }}
-- Main table
{% call statement('main') -%}
CREATE OR REPLACE TABLE {{ target_relation }} AS (
{{ sql }}
);
{%- endcall %}
-- Backup table
{% call statement('backup') -%}
CREATE OR REPLACE TABLE {{ backup_relation }} CLONE {{ target_relation }};
{%- endcall %}
{{ run_hooks(post_hooks) }}
{{ return({'relations': [target_relation]}) }}
{% endmaterialization %}
This guide covers the main concepts of dbt and its integration with Snowflake. As you work more with dbt, you’ll discover more advanced features and techniques to optimize your data transformations.