Iceberg

Apache Iceberg, an open-source table format for data lakes. This guide covers core concepts, features like ACID transactions and time travel, and implementation on platforms like Spark and Snowflake.

Iceberg

Apache Iceberg

Apache Iceberg is an open-source table format that has revolutionized large-scale data management in data lakes. Initially developed at Netflix to handle massive datasets efficiently, it has grown into a robust Apache project widely adopted across industries. This comprehensive guide explores Iceberg’s core concepts, features, implementation across major platforms, and best practices for effective data management. Iceberg provides significant advantages over traditional approaches, offering ACID transactions, schema evolution, and time travel capabilities while integrating seamlessly with popular query engines like Spark, Snowflake, Trino, and Flink.

What is Apache Iceberg?

Apache Iceberg is a high-performance table format for huge analytical datasets that addresses many limitations of traditional data lake storage approaches. It was designed to improve reliability, performance, and manageability of data lakes while providing features typically associated with mature database systems.

At its core, Iceberg is not a database itself but rather a specification for how data should be organized, tracked, and managed in a data lake environment. It works with data stored in Parquet, ORC, or Avro formats and provides a layer of management on top of these files.

Unlike traditional data lake tables, Iceberg offers:

  • A table-centric view of data rather than just a collection of files
  • Consistent snapshots of table data to ensure readers get a consistent view
  • ACID-compliant operations even when multiple users or processes access the data simultaneously
  • Advanced metadata management to track files, schema, and partitions efficiently

How Iceberg Differs from Traditional Approaches

Traditional data lakes often struggle with:

  • File management (discovering which files belong to a table)
  • Schema changes that break existing queries
  • No transaction support leading to incomplete or inconsistent reads
  • Manual partition management that becomes cumbersome at scale

Iceberg resolves these pain points through its sophisticated table format that treats data as a cohesive entity rather than a loose collection of files.

Key Features of Apache Iceberg

ACID Transactions

Iceberg provides full ACID (Atomicity, Consistency, Isolation, Durability) transaction support, ensuring data integrity even during concurrent operations. This prevents partial updates and guarantees that readers always see a consistent view of the data.

When multiple users make changes, Iceberg uses optimistic concurrency control to resolve conflicts, allowing operations to proceed in parallel without lengthy locks while still maintaining data consistency.

Schema Evolution

One of Iceberg’s most powerful features is its ability to evolve table schemas without breaking existing queries. You can:

  • Add new columns without requiring data migration
  • Rename columns while preserving backward compatibility
  • Remove columns that are no longer needed
  • Change column types safely in compatible ways

This flexibility is critical for long-lived datasets that need to adapt to changing business requirements without disrupting analytics workflows or requiring expensive data migrations.

Time Travel

Iceberg’s time travel capability allows querying historical versions of data, enabling:

  • Point-in-time analysis
  • Auditing changes for compliance
  • Rolling back to previous states after erroneous updates
  • Comparisons across different time periods

Unlike traditional databases where historical data is often lost, Iceberg preserves previous versions of the data through its snapshot-based architecture.

Partition Evolution

Traditional partitioning schemes often require rewriting all data when partition strategies change. Iceberg introduces partition evolution, which allows changing how data is partitioned without rewriting existing data.

This means you can:

  • Add new partition fields as query patterns evolve
  • Remove unnecessary partitioning
  • Change partition granularity (e.g., from daily to monthly)

Hidden Partitioning

Iceberg abstracts partition details from users, automating partition management and eliminating the need to specify partition values in queries. This “hidden partitioning” makes queries simpler and less error-prone while still benefiting from the performance advantages of partitioned data.

Efficient Data Skipping

Through rich metadata, Iceberg maintains statistics about data files, enabling query engines to skip reading files that don’t contain relevant data. This data skipping dramatically improves query performance, especially for highly selective queries on large datasets.

Multi-Engine Support

Unlike some table formats that tie you to specific query engines, Iceberg works with multiple engines, including:

  • Apache Spark
  • Snowflake
  • Flink
  • Trino
  • Presto
  • Hive
  • AWS Athena

This flexibility ensures you’re not locked into a single technology and can choose the right tool for each job.

How Apache Iceberg Works

Iceberg tables consist of three primary layers that work together to provide its advanced capabilities:

Metadata Layer

The metadata layer maintains information about:

  • Table versions (snapshots)
  • Schema definitions and evolution history
  • Partition specifications
  • File statistics and locations

This metadata is stored in JSON files within the table’s location and provides a comprehensive view of the table’s state and history.

Manifest Lists and Manifests

Manifest lists track different versions of manifest files, which in turn track data files. This two-level approach allows Iceberg to efficiently manage large tables with thousands or millions of files.

Manifests contain metadata about individual data files, including:

  • File path and format
  • Partition data
  • Record counts
  • Column-level statistics for data skipping
  • Delete file references for row-level deletes

Data Layer

The actual data resides in the data layer, typically stored as Parquet, ORC, or Avro files in cloud storage (S3, ADLS, GCS) or HDFS. These files remain immutable, meaning they’re never modified in place—changes create new files instead, which enables reliable snapshots and time travel.

Apache Iceberg vs. Other Table Formats

When comparing Iceberg to other modern table formats like Delta Lake and Apache Hudi, several distinctions become apparent:

Feature Apache Iceberg Delta Lake Hudi
ACID Support ✅ Yes ✅ Yes ✅ Yes
Schema Evolution ✅ Full Support ✅ Full Support ✅ Partial
Time Travel ✅ Yes ✅ Yes ✅ Yes
Partition Evolution ✅ Yes ❌ No ❌ No
Query Engines Spark, Flink, Trino, Presto, Hive Primarily Spark Spark, Flink
Data Skipping ✅ Yes ✅ Yes ✅ Yes
Developer Netflix, Apache Databricks Uber, Apache
Metadata Management Snapshot-based Transaction log-based Combination

Delta Lake, developed by Databricks, is especially well-integrated with the Databricks ecosystem and Microsoft Fabric, while Iceberg offers more flexibility across different query engines and platforms.

The main differences in implementation are:

  • Delta Lake uses a transaction log (_delta_log/) to track changes, with Parquet files for storing the actual data.
  • Iceberg uses snapshot-based metadata files and manifest files to track versions and data files.
  • Hudi offers both copy-on-write and merge-on-read strategies for different use cases.

Implementing Apache Iceberg

Working with Snowflake

Snowflake has added native support for Iceberg tables, allowing you to leverage Iceberg’s capabilities within the Snowflake ecosystem.

Creating an External Iceberg Table in Snowflake

SQL
CREATE OR REPLACE EXTERNAL TABLE my_db.iceberg_table (
    id INT,
    name STRING,
    age INT
) 
LOCATION = '@my_external_stage/my_iceberg_data/'
STORAGE_INTEGRATION = 'my_integration'
FILE_FORMAT = (TYPE = ICEBERG);

This creates an external Iceberg table pointing to data in cloud storage (AWS S3, Azure Blob, or GCS).

Querying with Time Travel in Snowflake

SQL
SELECT * FROM my_db.iceberg_table AT (TIMESTAMP => '2024-03-01 10:00:00');

This retrieves data from a past timestamp, showcasing Iceberg’s time travel capabilities.

Schema Evolution in Snowflake

SQL
ALTER TABLE my_db.iceberg_table ADD COLUMN email STRING;

This adds a new column without breaking existing queries or requiring data rewriting.

Working with Microsoft Fabric

Microsoft Fabric, with its Lakehouse architecture based on OneLake, can integrate with Iceberg tables through Apache Spark.

Creating an Iceberg Table in Microsoft Fabric

Python
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("IcebergFabric").getOrCreate()

# Define table path (Microsoft OneLake)
table_path = "abfss://lakehouse@onelake.dfs.fabric.microsoft.com/iceberg_table/"

# Create Iceberg table
spark.sql(f"""
    CREATE TABLE iceberg_table (
        id INT,
        name STRING,
        age INT
    ) USING ICEBERG LOCATION '{table_path}'
""")

This creates an Iceberg table in Microsoft Fabric’s OneLake storage.

Inserting and Querying Data

Python
# Insert data
data = [(1, 'Alice', 30), (2, 'Bob', 28)]
df = spark.createDataFrame(data, ["id", "name", "age"])
df.write.format("iceberg").mode("append").save(table_path)

# Query data
df = spark.read.format("iceberg").load(table_path)
df.show()

This demonstrates basic data operations with Iceberg in Fabric.

Cloud Implementations

AWS S3 + Iceberg (Apache Spark)

Python
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.my_catalog.type", "hadoop") \
    .config("spark.sql.catalog.my_catalog.warehouse", "s3://my-bucket/iceberg/") \
    .getOrCreate()

# Create Iceberg table
spark.sql("""
    CREATE TABLE my_catalog.db.iceberg_table (
        id INT,
        name STRING,
        age INT
    ) USING ICEBERG
""")

This sets up an Iceberg table in AWS S3, with Spark as the query engine.

Azure Data Lake + Iceberg

Python
spark.conf.set("spark.sql.catalog.azure_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.azure_catalog.type", "hadoop")
spark.conf.set("spark.sql.catalog.azure_catalog.warehouse", "abfss://mycontainer@myaccount.dfs.core.windows.net/iceberg/")

# Create Iceberg table in ADLS
spark.sql("""
    CREATE TABLE azure_catalog.db.iceberg_table (
        id INT,
        name STRING,
        age INT
    ) USING ICEBERG
""")

This establishes an Iceberg table in Azure Data Lake Storage Gen2.

External Data Management Concepts

When working with cloud data lakes, three related concepts often come up: External Tables, External Stages, and External Volumes. These are particularly relevant when integrating Iceberg with platforms like Snowflake.

External Tables

An External Table is a table that doesn’t store data inside a database but instead reads data directly from external storage like AWS S3, Azure Blob, or Google Cloud Storage.

Key characteristics:

  • Data remains in external storage
  • Can be queried like a normal table
  • Doesn’t support full ACID transactions unless using a format like Iceberg
  • Good for large datasets you don’t want to move into a database

Example: Creating an External Table in Snowflake

SQL
CREATE OR REPLACE EXTERNAL TABLE my_db.external_table (
    id INT AS (value:id::INT),
    name STRING AS (value:name::STRING)
)
WITH LOCATION = '@my_external_stage/data/'
FILE_FORMAT = (TYPE = PARQUET);

Benefits of External Tables:

  • Avoid moving large datasets into the database
  • Save storage costs in platforms like Snowflake
  • Support fast querying of external data without loading it

External Stages

An External Stage is a location in cloud storage that acts as a bridge between your data lake and your database. It tells your database where to find external files.

Key purposes:

  • Define where external data is located
  • Enable loading data from cloud storage into a database
  • Support querying data directly from the stage

Example: Creating an External Stage in Snowflake

SQL
CREATE OR REPLACE STAGE my_external_stage
URL = 's3://my-bucket/data/'
STORAGE_INTEGRATION = 'my_integration';

With an External Stage, you can:

  • List files: LIST @my_external_stage;
  • Read raw data: SELECT $1 FROM @my_external_stage;
  • Load data into internal tables: COPY INTO my_table FROM @my_external_stage;

External Volumes

An External Volume is a higher-level storage concept in Snowflake that allows more control over external storage locations. Unlike External Stages, which are mainly for querying or loading data, External Volumes allow read/write access to cloud storage.

Key features:

  • Better security and access management
  • Both read and write access to cloud storage
  • Ideal for ETL and machine learning workloads

Example: Creating an External Volume

SQL
CREATE OR REPLACE EXTERNAL VOLUME my_volume
STORAGE_LOCATION = 's3://my-bucket/my-data/'
STORAGE_INTEGRATION = my_integration;

With External Volumes, you can write data back to cloud storage:

SQL
COPY INTO @my_volume/path/data.parquet
FROM my_table
FILE_FORMAT = (TYPE = PARQUET);

This capability makes External Volumes particularly useful for integrating with Iceberg, as they provide the read/write access needed for maintaining Iceberg tables in external storage.

Iceberg File Structure

When you create an Iceberg table, it organizes data into a structured file hierarchy that might initially seem complex but serves important purposes for performance and functionality.

Understanding the File Organization

A typical Iceberg table in cloud storage (like AWS S3) has this structure:

s3://my-bucket/iceberg_table/
 ├── metadata/          # Stores table schema & snapshot history
 │   ├── v0001.metadata.json
 │   ├── v0002.metadata.json
 ├── data/              # Contains actual table data (Parquet files)
 │   ├── 00001-abc.parquet
 │   ├── 00002-def.parquet
 ├── manifest/          # Tracks which data files belong to which snapshot
 │   ├── 00001.manifest
 │   ├── 00002.manifest

Each directory serves a specific purpose:

  • metadata/: Contains JSON files with table schema, partition specs, and snapshot information
  • data/: Stores the actual data in Parquet format (or ORC/Avro)
  • manifest/: Contains metadata about which data files belong to which snapshot

Why Multiple Files for a Single Table?

A common source of confusion is seeing multiple Parquet files in the data folder when you’ve only created a single table. This happens because:

  1. Performance Optimization: Iceberg splits data into multiple files for faster parallel processing
  2. Snapshot-Based Versioning: Each update creates new files rather than modifying existing ones
  3. Parallel Query Support: Multiple smaller files allow distributed query engines to scan data in parallel

When you insert data into an Iceberg table, it creates new Parquet files in the data folder:

SQL
INSERT INTO my_catalog.iceberg_table VALUES 
(1, 'Alice', 100.50, '2024-03-10'),
(2, 'Bob', 200.75, '2024-03-11');

This might result in multiple files like:

s3://my-bucket/iceberg_table/data/
 ├── 00001-abc.parquet
 ├── 00002-def.parquet

Each time you make changes, more files may be added, and Iceberg tracks which files belong to each version of the table.

Working with Multiple Files

You don’t need to manage or access these files directly. Instead, use your query engine (Snowflake, Spark, etc.) to interact with the Iceberg table:

SQL
-- Query current data
SELECT * FROM my_catalog.iceberg_table;

-- See which files are used
SELECT * FROM my_catalog.information_schema.files 
WHERE table_name = 'iceberg_table';

If the number of small files becomes excessive, you can compact them:

SQL
CALL my_catalog.system.rewrite_data_files();

This operation merges smaller files into larger ones for better query performance while preserving the table’s logical structure.

Time Travel in Apache Iceberg

Time travel is a powerful capability in Apache Iceberg, allowing you to query historical versions of data. This feature works differently than in some database systems like Snowflake.

How Iceberg Time Travel Works vs. Snowflake

Feature Snowflake Time Travel Iceberg Time Travel
How It Works Uses Fail-safe storage and automatically retains past data versions Uses Metadata Snapshots stored in object storage (S3, ADLS, GCS)
Default Retention Period 1 day (standard), up to 90 days (Enterprise edition) Retains snapshots forever unless manually expired
Storage Location Inside Snowflake storage (costs extra) Stored in metadata files + actual data in cloud storage
Performance Impact Fast but costs extra storage Efficient, low cost, but requires cleanup
Best Used For Short-term recovery, auditing, and rollbacks Long-term versioning, ML model tracking, and large data lake queries
Automatic Cleanup? Yes, Snowflake automatically deletes old versions after retention period No, you must manually expire old snapshots

Unlike Snowflake, Iceberg doesn’t automatically delete historical data. Instead, it keeps all snapshots indefinitely until you explicitly expire them.

Managing Time Travel Retention

By default, Iceberg keeps all historical snapshots forever. To manage storage costs, you need to explicitly expire old snapshots:

SQL
-- Remove snapshots older than a specific date
CALL my_catalog.system.expire_snapshots(older_than => TIMESTAMP '2024-01-01 00:00:00');

-- Keep only the last 30 days of snapshots
CALL my_catalog.system.expire_snapshots(older_than => current_timestamp() - INTERVAL '30' DAY);

This gives you complete control over how long history is retained.

Working with Snapshots

To use time travel effectively, you need to understand snapshots—the mechanism by which Iceberg tracks table versions.

Listing available snapshots:

SQL
SELECT * FROM my_catalog.system.snapshots 
WHERE table_name = 'iceberg_table';

This shows all snapshots with their IDs and timestamps.

Querying a table at a specific snapshot:

SQL
SELECT * FROM my_catalog.iceberg_table 
FOR SYSTEM_VERSION AS OF 1234567890123;

You can also roll back a table to a previous snapshot if needed:

SQL
CALL my_catalog.system.rollback_to_snapshot(snapshot_id => 1234567890123);

This restores the table to the specified snapshot without having to restore from backups.

Best Practices and Optimization

To get the most out of Apache Iceberg, consider these best practices:

File Management

  1. Monitor file sizes: Too many small files can degrade performance.
  2. Regularly compact files: Use rewrite_data_files() to merge small files into larger ones.
  3. Balance file sizes: Aim for file sizes between 100MB and 1GB for optimal performance.
  4. Consider partitioning strategy: Choose partitioning based on query patterns, but don’t over-partition.

Query Optimization

  1. Leverage data skipping: Design queries that can benefit from Iceberg’s metadata-based filtering.
  2. Use appropriate predicate pushdown: Filter data as early as possible in the query.
  3. Monitor snapshot growth: Too many snapshots can slow down metadata operations.
  4. Expire old snapshots: Regularly clean up unneeded historical data.

Integration Best Practices

  1. Choose the right storage format: Parquet is generally recommended for most use cases.
  2. Consider compression: Use compression like Snappy or Zstd for better storage efficiency.
  3. Secure access properly: Set up appropriate access controls on your cloud storage.
  4. Implement monitoring: Track table metrics like file count, average file size, and query performance.

Conclusion

Apache Iceberg represents a significant advancement in data lake management, bringing database-like reliability and performance to cloud storage. Its key strengths include ACID transactions, schema evolution, advanced partitioning, and time travel capabilities. These features make it an excellent choice for large-scale analytical workloads, especially when working across multiple query engines.

When deciding whether to use Iceberg, consider these guidelines:

  • Use Iceberg when you need ACID transactions for data in cloud storage, frequent schema changes, historical data access, or multi-engine compatibility.
  • Consider alternatives like Delta Lake if you’re primarily working in the Databricks ecosystem or Microsoft Fabric.
  • Combine with external tables/volumes in platforms like Snowflake to get the best of both worlds—Snowflake’s performance and Iceberg’s flexibility.

As data lake technologies continue to evolve, Iceberg has emerged as a leader in providing a robust, open table format that bridges the gap between traditional data lakes and data warehouses. By understanding and implementing Iceberg effectively, organizations can build more reliable, performant, and flexible data architectures for their analytical needs.