Overview

Relevant source files

Purpose and Scope

The iceberg-cpp library is a C++ implementation of the Apache Iceberg table format specification. It provides a native C++ API for reading, writing, and managing Iceberg tables with support for ACID transactions, schema evolution, time travel queries, and multiple storage backends.

This document provides a high-level overview of the library's architecture, major components, and design principles. For detailed information on specific topics:

Building and dependency management: see Getting Started and Dependency Management
Type system and schema definitions: see Core Concepts and Type System
Table operations and transactions: see Table Operations
Query planning and data access: see Query Planning and Data Access
File format support: see File Format Support

Sources: src/iceberg/table.h src/iceberg/table_metadata.h README.md

Key Features

The library implements the following Iceberg capabilities:

Feature	Description	Key Components
ACID Transactions	Optimistic concurrency control with atomic commits	`Transaction`, `TableMetadataBuilder`, `TableRequirement`
Schema Evolution	Add, drop, rename, and reorder columns without rewriting data	`UpdateSchema`, `Schema`, `SchemaField`
Time Travel	Query historical table snapshots	`Snapshot`, `SnapshotRef`, `TableScan`
Partition Evolution	Change partitioning scheme without data migration	`PartitionSpec`, `UpdatePartitionSpec`
Multiple File Formats	Read and write Avro, Parquet, and ORC files	`Reader`, `Writer`, `ReaderFactory`
Predicate Pushdown	Filter data at partition, file, and row levels	`Expression`, `ManifestEvaluator`, `InclusiveMetricsEvaluator`
Hidden Partitioning	Partition transforms applied transparently	`Transform`, `PartitionField`
Sort Orders	Specify and maintain data ordering	`SortOrder`, `SortField`
Delete Files	Position and equality deletes for row-level operations	`DeleteFileIndex`, `ManifestEntry`
REST Catalog	Remote catalog protocol implementation	`RestCatalog`, `AuthManager`

Sources: src/iceberg/table.h37-169 src/iceberg/transaction.h32-151 src/iceberg/table_metadata.h64-165

Architecture

Layered Design

The library follows a four-layer architecture that separates concerns and enables modularity:

Figure 1: Four-Layer Architecture

Sources: src/iceberg/table.h src/iceberg/table_metadata.h src/iceberg/table_scan.h src/iceberg/file_io.h

Component Relationships

The following diagram shows how major components interact during typical operations:

Figure 2: Component Interaction Graph

Sources: src/iceberg/table.h38-182 src/iceberg/transaction.h33-151 src/iceberg/table_metadata.h220-488 src/iceberg/update/pending_update.h35-94

Major Components

Table and Catalog

The Table class is the primary entry point for interacting with Iceberg tables. It provides access to table metadata and factory methods for creating operations:

Location: src/iceberg/table.h38-182 src/iceberg/table.cc1-280
Key Methods:
- schema(), spec(), sort_order(): Access current table configuration
- NewScan(): Create query builders
- NewUpdateSchema(), NewFastAppend(), etc.: Create update operations
- Refresh(): Reload metadata from catalog

The Catalog interface abstracts table discovery and storage:

Implementations: RestCatalog for remote catalogs, InMemoryCatalog for testing
Operations: LoadTable(), CreateTable(), UpdateTable(), namespace management

Sources: src/iceberg/table.h38-224 src/iceberg/catalog.h

Metadata Management

The metadata layer manages versioned table state through immutable snapshots:

TableMetadata

The TableMetadata struct (src/iceberg/table_metadata.h72-165) contains all table configuration:

Sources: src/iceberg/table_metadata.h72-165 src/iceberg/table_metadata.cc1-1679

TableMetadataBuilder

The TableMetadataBuilder class (src/iceberg/table_metadata.h220-488) provides a fluent API for constructing new metadata versions:

Validates changes against Iceberg specification rules
Tracks changes as TableUpdate instances
Generates TableRequirement instances for optimistic concurrency control
Supports both creating new tables and evolving existing ones

Sources: src/iceberg/table_metadata.h220-488 src/iceberg/table_metadata.cc543-1679

Transaction System

The Transaction class (src/iceberg/transaction.h33-151) coordinates atomic commits of multiple table changes:

Figure 3: Transaction Flow for Schema Update

Transaction modes:

Auto-commit: Each operation commits immediately (default for Table methods)
Explicit: Multiple operations batched into single atomic commit

Sources: src/iceberg/transaction.h33-151 src/iceberg/transaction.cc1-432

Schema and Type System

The type system is built on a hierarchy of Type classes:

Figure 4: Type System Hierarchy

A Schema (src/iceberg/schema.h49-198) is a StructType with additional metadata:

Unique schema_id for tracking schema evolution
Identifier fields for row uniqueness constraints
Caching layer for fast field lookups by name or ID

Sources: src/iceberg/type.h44-362 src/iceberg/type.cc1-439 src/iceberg/schema.h49-198 src/iceberg/schema.cc1-307

Query Planning and Execution

The query path transforms high-level table scans into executable file-level tasks:

Figure 5: Query Execution Pipeline

Key optimizations:

Partition pruning: Filter files using partition predicates
File pruning: Use column statistics to skip files
Residual predicates: Push remaining filters to readers
Lazy manifest loading: Load manifests only when needed

Sources: src/iceberg/table_scan.h src/iceberg/manifest/manifest_group.h src/iceberg/file_reader.h

File Format Integration

The library supports multiple file formats through a plugin architecture:

Figure 6: File Format Architecture

Format implementations are registered at startup (src/iceberg/avro/avro_register.cc src/iceberg/parquet/parquet_register.cc) and selected based on file extension or explicit configuration.

Sources: src/iceberg/file_reader.h src/iceberg/file_writer.h src/iceberg/avro/avro_reader.h src/iceberg/parquet/parquet_reader.h

Core Concepts

Snapshots and Time Travel

A Snapshot (src/iceberg/snapshot.h) represents an immutable point-in-time view of a table:

Contains manifest list location (catalog of data files)
Tracks summary statistics (added/deleted files, row counts)
Assigned unique snapshot_id and sequence_number
Referenced by branches and tags through SnapshotRef

Time travel is achieved by:

Specifying snapshot ID or timestamp in TableScan
Loading corresponding manifest list
Planning against historical file set

Sources: src/iceberg/snapshot.h src/iceberg/table_scan.h

Partition Specifications

A PartitionSpec (src/iceberg/partition_spec.h) defines how data is organized:

Partition evolution allows changing specs without rewriting data:

New spec assigned next available spec_id
Old data files retain original spec_id
Query planning handles multiple specs transparently

Sources: src/iceberg/partition_spec.h src/iceberg/transform.h

Manifests and Data Files

The manifest system tracks data file locations and statistics:

Figure 7: Manifest Structure

Each ManifestEntry contains:

File path and format
Partition values
Column-level statistics (min/max/null counts)
File size and row count

Sources: src/iceberg/manifest/manifest_reader.h src/iceberg/manifest/manifest_list.h

Design Principles

Error Handling

The library uses a Result<T> monad (src/iceberg/result.h) for functional error handling:

Error categories are defined in ErrorKind:

kNotFound: Resource doesn't exist
kInvalidArgument: Bad input parameters
kInvalidSchema: Schema validation failure
kCommitFailed: Optimistic concurrency conflict
kValidationFailed: Business rule violation

Sources: src/iceberg/result.h src/iceberg/exception.h

Immutability and Builder Pattern

Core data structures are immutable once constructed:

TableMetadata: Modified through TableMetadataBuilder
Schema: Created through factory methods or builder
Snapshot: Never modified after creation

This design enables:

Safe concurrent reads
Clear change tracking
Reliable rollback on errors

Sources: src/iceberg/table_metadata.h220-488 src/iceberg/schema.h43-68

Arrow C ABI Integration

The library uses Apache Arrow's C Data Interface for zero-copy data exchange:

ArrowArrayStream: Streaming data interface
ArrowSchema: Schema description
ArrowArray: Columnar batch data

This enables interoperability with Arrow, DuckDB, Polars, and other systems without serialization overhead.

Sources: src/iceberg/arrow_c_data.h src/iceberg/file_reader.h

Build System

The library supports both CMake and Meson build systems with modular output:

Library	Contents	Dependencies
`libiceberg`	Core metadata and planning	nanoarrow, nlohmann-json, CRoaring, zlib
`libiceberg_bundle`	File format support (optional)	Arrow, Avro, Parquet, Zstd
`libiceberg_rest`	REST catalog (optional)	cpr, libcurl

Build flags control which features are included:

ICEBERG_BUILD_BUNDLE: Enable Avro/Parquet support
ICEBERG_BUILD_REST: Enable REST catalog client
ICEBERG_BUILD_TESTS: Build test suite

Sources: src/iceberg/CMakeLists.txt1-250 src/iceberg/meson.build1-230 cmake_modules/IcebergThirdpartyToolchain.cmake

Next Steps

For detailed information on specific aspects of the library:

Installation and setup: Building from Source and Dependency Management
Using the library: Quick Start Examples and Table Operations
Understanding internals: Core Concepts and Library Architecture
Advanced features: Query Planning and File Format Support
Contributing: Development and Code Quality Standards