Apache Iceberg, an open-source table format for data lakes. This guide covers core concepts, features like ACID transactions and time travel, and implementation on platforms like Spark and Snowflake.
Apache Iceberg is an open-source table format that has revolutionized large-scale data management in data lakes. Initially developed at Netflix to handle massive datasets efficiently, it has grown into a robust Apache project widely adopted across industries. This comprehensive guide explores Iceberg’s core concepts, features, implementation across major platforms, and best practices for effective data management. Iceberg provides significant advantages over traditional approaches, offering ACID transactions, schema evolution, and time travel capabilities while integrating seamlessly with popular query engines like Spark, Snowflake, Trino, and Flink.
Apache Iceberg is a high-performance table format for huge analytical datasets that addresses many limitations of traditional data lake storage approaches. It was designed to improve reliability, performance, and manageability of data lakes while providing features typically associated with mature database systems.
At its core, Iceberg is not a database itself but rather a specification for how data should be organized, tracked, and managed in a data lake environment. It works with data stored in Parquet, ORC, or Avro formats and provides a layer of management on top of these files.
Unlike traditional data lake tables, Iceberg offers:
Traditional data lakes often struggle with:
Iceberg resolves these pain points through its sophisticated table format that treats data as a cohesive entity rather than a loose collection of files.
Iceberg provides full ACID (Atomicity, Consistency, Isolation, Durability) transaction support, ensuring data integrity even during concurrent operations. This prevents partial updates and guarantees that readers always see a consistent view of the data.
When multiple users make changes, Iceberg uses optimistic concurrency control to resolve conflicts, allowing operations to proceed in parallel without lengthy locks while still maintaining data consistency.
One of Iceberg’s most powerful features is its ability to evolve table schemas without breaking existing queries. You can:
This flexibility is critical for long-lived datasets that need to adapt to changing business requirements without disrupting analytics workflows or requiring expensive data migrations.
Iceberg’s time travel capability allows querying historical versions of data, enabling:
Unlike traditional databases where historical data is often lost, Iceberg preserves previous versions of the data through its snapshot-based architecture.
Traditional partitioning schemes often require rewriting all data when partition strategies change. Iceberg introduces partition evolution, which allows changing how data is partitioned without rewriting existing data.
This means you can:
Iceberg abstracts partition details from users, automating partition management and eliminating the need to specify partition values in queries. This “hidden partitioning” makes queries simpler and less error-prone while still benefiting from the performance advantages of partitioned data.
Through rich metadata, Iceberg maintains statistics about data files, enabling query engines to skip reading files that don’t contain relevant data. This data skipping dramatically improves query performance, especially for highly selective queries on large datasets.
Unlike some table formats that tie you to specific query engines, Iceberg works with multiple engines, including:
This flexibility ensures you’re not locked into a single technology and can choose the right tool for each job.
Iceberg tables consist of three primary layers that work together to provide its advanced capabilities:
The metadata layer maintains information about:
This metadata is stored in JSON files within the table’s location and provides a comprehensive view of the table’s state and history.
Manifest lists track different versions of manifest files, which in turn track data files. This two-level approach allows Iceberg to efficiently manage large tables with thousands or millions of files.
Manifests contain metadata about individual data files, including:
The actual data resides in the data layer, typically stored as Parquet, ORC, or Avro files in cloud storage (S3, ADLS, GCS) or HDFS. These files remain immutable, meaning they’re never modified in place—changes create new files instead, which enables reliable snapshots and time travel.
When comparing Iceberg to other modern table formats like Delta Lake and Apache Hudi, several distinctions become apparent:
Feature | Apache Iceberg | Delta Lake | Hudi |
---|---|---|---|
ACID Support | ✅ Yes | ✅ Yes | ✅ Yes |
Schema Evolution | ✅ Full Support | ✅ Full Support | ✅ Partial |
Time Travel | ✅ Yes | ✅ Yes | ✅ Yes |
Partition Evolution | ✅ Yes | ❌ No | ❌ No |
Query Engines | Spark, Flink, Trino, Presto, Hive | Primarily Spark | Spark, Flink |
Data Skipping | ✅ Yes | ✅ Yes | ✅ Yes |
Developer | Netflix, Apache | Databricks | Uber, Apache |
Metadata Management | Snapshot-based | Transaction log-based | Combination |
Delta Lake, developed by Databricks, is especially well-integrated with the Databricks ecosystem and Microsoft Fabric, while Iceberg offers more flexibility across different query engines and platforms.
The main differences in implementation are:
_delta_log/
) to track changes, with Parquet files for storing the actual data.Snowflake has added native support for Iceberg tables, allowing you to leverage Iceberg’s capabilities within the Snowflake ecosystem.
CREATE OR REPLACE EXTERNAL TABLE my_db.iceberg_table (
id INT,
name STRING,
age INT
)
LOCATION = '@my_external_stage/my_iceberg_data/'
STORAGE_INTEGRATION = 'my_integration'
FILE_FORMAT = (TYPE = ICEBERG);
This creates an external Iceberg table pointing to data in cloud storage (AWS S3, Azure Blob, or GCS).
SELECT * FROM my_db.iceberg_table AT (TIMESTAMP => '2024-03-01 10:00:00');
This retrieves data from a past timestamp, showcasing Iceberg’s time travel capabilities.
ALTER TABLE my_db.iceberg_table ADD COLUMN email STRING;
This adds a new column without breaking existing queries or requiring data rewriting.
Microsoft Fabric, with its Lakehouse architecture based on OneLake, can integrate with Iceberg tables through Apache Spark.
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("IcebergFabric").getOrCreate()
# Define table path (Microsoft OneLake)
table_path = "abfss://lakehouse@onelake.dfs.fabric.microsoft.com/iceberg_table/"
# Create Iceberg table
spark.sql(f"""
CREATE TABLE iceberg_table (
id INT,
name STRING,
age INT
) USING ICEBERG LOCATION '{table_path}'
""")
This creates an Iceberg table in Microsoft Fabric’s OneLake storage.
# Insert data
data = [(1, 'Alice', 30), (2, 'Bob', 28)]
df = spark.createDataFrame(data, ["id", "name", "age"])
df.write.format("iceberg").mode("append").save(table_path)
# Query data
df = spark.read.format("iceberg").load(table_path)
df.show()
This demonstrates basic data operations with Iceberg in Fabric.
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder \
.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.my_catalog.type", "hadoop") \
.config("spark.sql.catalog.my_catalog.warehouse", "s3://my-bucket/iceberg/") \
.getOrCreate()
# Create Iceberg table
spark.sql("""
CREATE TABLE my_catalog.db.iceberg_table (
id INT,
name STRING,
age INT
) USING ICEBERG
""")
This sets up an Iceberg table in AWS S3, with Spark as the query engine.
spark.conf.set("spark.sql.catalog.azure_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.azure_catalog.type", "hadoop")
spark.conf.set("spark.sql.catalog.azure_catalog.warehouse", "abfss://mycontainer@myaccount.dfs.core.windows.net/iceberg/")
# Create Iceberg table in ADLS
spark.sql("""
CREATE TABLE azure_catalog.db.iceberg_table (
id INT,
name STRING,
age INT
) USING ICEBERG
""")
This establishes an Iceberg table in Azure Data Lake Storage Gen2.
When working with cloud data lakes, three related concepts often come up: External Tables, External Stages, and External Volumes. These are particularly relevant when integrating Iceberg with platforms like Snowflake.
An External Table is a table that doesn’t store data inside a database but instead reads data directly from external storage like AWS S3, Azure Blob, or Google Cloud Storage.
Key characteristics:
Example: Creating an External Table in Snowflake
CREATE OR REPLACE EXTERNAL TABLE my_db.external_table (
id INT AS (value:id::INT),
name STRING AS (value:name::STRING)
)
WITH LOCATION = '@my_external_stage/data/'
FILE_FORMAT = (TYPE = PARQUET);
Benefits of External Tables:
An External Stage is a location in cloud storage that acts as a bridge between your data lake and your database. It tells your database where to find external files.
Key purposes:
Example: Creating an External Stage in Snowflake
CREATE OR REPLACE STAGE my_external_stage
URL = 's3://my-bucket/data/'
STORAGE_INTEGRATION = 'my_integration';
With an External Stage, you can:
LIST @my_external_stage;
SELECT $1 FROM @my_external_stage;
COPY INTO my_table FROM @my_external_stage;
An External Volume is a higher-level storage concept in Snowflake that allows more control over external storage locations. Unlike External Stages, which are mainly for querying or loading data, External Volumes allow read/write access to cloud storage.
Key features:
Example: Creating an External Volume
CREATE OR REPLACE EXTERNAL VOLUME my_volume
STORAGE_LOCATION = 's3://my-bucket/my-data/'
STORAGE_INTEGRATION = my_integration;
With External Volumes, you can write data back to cloud storage:
COPY INTO @my_volume/path/data.parquet
FROM my_table
FILE_FORMAT = (TYPE = PARQUET);
This capability makes External Volumes particularly useful for integrating with Iceberg, as they provide the read/write access needed for maintaining Iceberg tables in external storage.
When you create an Iceberg table, it organizes data into a structured file hierarchy that might initially seem complex but serves important purposes for performance and functionality.
A typical Iceberg table in cloud storage (like AWS S3) has this structure:
s3://my-bucket/iceberg_table/
├── metadata/ # Stores table schema & snapshot history
│ ├── v0001.metadata.json
│ ├── v0002.metadata.json
├── data/ # Contains actual table data (Parquet files)
│ ├── 00001-abc.parquet
│ ├── 00002-def.parquet
├── manifest/ # Tracks which data files belong to which snapshot
│ ├── 00001.manifest
│ ├── 00002.manifest
Each directory serves a specific purpose:
A common source of confusion is seeing multiple Parquet files in the data folder when you’ve only created a single table. This happens because:
When you insert data into an Iceberg table, it creates new Parquet files in the data folder:
INSERT INTO my_catalog.iceberg_table VALUES
(1, 'Alice', 100.50, '2024-03-10'),
(2, 'Bob', 200.75, '2024-03-11');
This might result in multiple files like:
s3://my-bucket/iceberg_table/data/
├── 00001-abc.parquet
├── 00002-def.parquet
Each time you make changes, more files may be added, and Iceberg tracks which files belong to each version of the table.
You don’t need to manage or access these files directly. Instead, use your query engine (Snowflake, Spark, etc.) to interact with the Iceberg table:
-- Query current data
SELECT * FROM my_catalog.iceberg_table;
-- See which files are used
SELECT * FROM my_catalog.information_schema.files
WHERE table_name = 'iceberg_table';
If the number of small files becomes excessive, you can compact them:
CALL my_catalog.system.rewrite_data_files();
This operation merges smaller files into larger ones for better query performance while preserving the table’s logical structure.
Time travel is a powerful capability in Apache Iceberg, allowing you to query historical versions of data. This feature works differently than in some database systems like Snowflake.
Feature | Snowflake Time Travel | Iceberg Time Travel |
---|---|---|
How It Works | Uses Fail-safe storage and automatically retains past data versions | Uses Metadata Snapshots stored in object storage (S3, ADLS, GCS) |
Default Retention Period | 1 day (standard), up to 90 days (Enterprise edition) | Retains snapshots forever unless manually expired |
Storage Location | Inside Snowflake storage (costs extra) | Stored in metadata files + actual data in cloud storage |
Performance Impact | Fast but costs extra storage | Efficient, low cost, but requires cleanup |
Best Used For | Short-term recovery, auditing, and rollbacks | Long-term versioning, ML model tracking, and large data lake queries |
Automatic Cleanup? | Yes, Snowflake automatically deletes old versions after retention period | No, you must manually expire old snapshots |
Unlike Snowflake, Iceberg doesn’t automatically delete historical data. Instead, it keeps all snapshots indefinitely until you explicitly expire them.
By default, Iceberg keeps all historical snapshots forever. To manage storage costs, you need to explicitly expire old snapshots:
-- Remove snapshots older than a specific date
CALL my_catalog.system.expire_snapshots(older_than => TIMESTAMP '2024-01-01 00:00:00');
-- Keep only the last 30 days of snapshots
CALL my_catalog.system.expire_snapshots(older_than => current_timestamp() - INTERVAL '30' DAY);
This gives you complete control over how long history is retained.
To use time travel effectively, you need to understand snapshots—the mechanism by which Iceberg tracks table versions.
Listing available snapshots:
SELECT * FROM my_catalog.system.snapshots
WHERE table_name = 'iceberg_table';
This shows all snapshots with their IDs and timestamps.
Querying a table at a specific snapshot:
SELECT * FROM my_catalog.iceberg_table
FOR SYSTEM_VERSION AS OF 1234567890123;
You can also roll back a table to a previous snapshot if needed:
CALL my_catalog.system.rollback_to_snapshot(snapshot_id => 1234567890123);
This restores the table to the specified snapshot without having to restore from backups.
To get the most out of Apache Iceberg, consider these best practices:
rewrite_data_files()
to merge small files into larger ones.Apache Iceberg represents a significant advancement in data lake management, bringing database-like reliability and performance to cloud storage. Its key strengths include ACID transactions, schema evolution, advanced partitioning, and time travel capabilities. These features make it an excellent choice for large-scale analytical workloads, especially when working across multiple query engines.
When deciding whether to use Iceberg, consider these guidelines:
As data lake technologies continue to evolve, Iceberg has emerged as a leader in providing a robust, open table format that bridges the gap between traditional data lakes and data warehouses. By understanding and implementing Iceberg effectively, organizations can build more reliable, performant, and flexible data architectures for their analytical needs.