Medallion Architecture

What is the Medallion Architecture?

The medallion architecture is the industry-standard pattern for organize your data in a reliable and scalable way by moving it through three successive layers of validation and transformation in the Lakehouse.

Bronze

Raw, immutable data: your source of truth for audits and reprocessing.

Silver

Cleaned, flattened, validated, and standardized data, used to build downstream pipelines.

Gold

Business-ready metrics, aggregates, and ML features optimized for consumption.

The Medallion architecture separation delivers for you and your team:

Quality: Data quality issues are caught early, before they cascade.

Performance: Heavy transformations are done once, punctual aggregations are done.

Governance: Complete audit trail with documented transformation rules.

Team velocity: Engineers own Bronze/Silver, analysts and business users work directly with Gold.

Why Teams Choose Bauplan for their Medallion Architecture

Pure Python, No Heavyweight Stack

Build the entire medallion flow in pure Python and SQL using Pandas, Polars and DuckDB. No Spark, no JVM, no cluster overhead. Bauplan handles execution with Arrow under the hood, so you focus purely on transformations and logic.

Data Quality Built In

Define data quality tests as code right next to your pipeline steps. Checks like null detection and uniqueness run directly on in-memory tables, failing fast if issues appear. No extra pipelines, no waiting to validate data, no extra compute wasted.

Integrated Versioning and Orchestration

Bauplan unifies your Iceberg lakehouse, data branching, and execution in one simple system. Spin up branches for medallion layers, run DAGs in isolation, and merge back seamlessly: no catalog wrangling, no custom glue code.

Namespace Separation for Medallion Layers

Bronze, Silver, and Gold datasets live in distinct namespaces, making lineage and promotion clear. Combined with branch-test-merge, this ensures production remains stable and reproducible as data progresses through each stage.

How Bauplan Makes Medallion Simple

Import data into the bronze layer

Raw data lands in the Bronze layer. Bauplan handles ingestion into Iceberg tables on isolated branches, so you can safely load from S3 without touching production.

import bauplan


def import_data_in_bronze_layer(table_name, import_branch, source_s3):
    client = bauplan.Client()
    try:
        # 1) Create + import table
        client.create_table(
            table=table_name, 
            search_uri=source_s3, 
            branch=import_branch, 
            namespace="bronze"
        )
        client.import_data(
            table=table_name, 
            search_uri=source_s3, 
            branch=import_branch, 
            namespace="bronze"
        )
    except bauplan.exceptions.BauplanError as e:
        raise Exception(f"🔴 Bronze import failed: {e}")
    
    # 2) Merge into main
    if not client.merge_branch(source_ref=import_branch, into_branch="main"):
        raise Exception("🔴 Merge failed.")
    print(f"✅ Bronze '{import_branch}' merged into main.")

import bauplan
from bauplan.standard_expectations import expect_column_no_nulls

@bauplan.model()
@bauplan.python('3.11', pip={'polars': '1.33.1'})
def silver_table(data=bauplan.Model('bronze_table')):
    import polars as pl
    
    # Convert input data into Polars and clean it
    df = data.to_polars()
    time_filter_utc = pl.Series(
        [pl.datetime(2022, 1, 1, 0, 0, 0, "UTC")]
    )[0]
    df = df.filter(pl.col("timestamp") >= time_filter_utc)
   
    return df.to_arrow()


@bauplan.expectation()
@bauplan.python("3.11")
def test_nulls(data=bauplan.Model('bronze_table')):
    col_to_check = "timestamp"    
    # Run expectation: Check for nulls and fail if nulls are found
    is_passed = expect_column_no_nulls(data, col_to_check)
    assert is_passed, f"❌ nulls found in {col_to_check}"
    
    return is_passed

Transform and check for data quality

Silver tables are created with Python models and SQL. Transformations live alongside expectations, so data quality checks run as part of the pipeline. Fail fast, fix early.

Validate tables and promote to Gold

Validated data moves into the Gold layer. Aggregations, joins, and business-ready tables are promoted only after tests succeed, keeping production consistent and reliable.

import bauplan


def promote_to_gold(
    client: bauplan.Client,
    pipeline_dir: list,
    branch: str,
):
    # 1) Run the pipeline on a separate branch
    run_state = client.run(
        project_dir=pipeline_dir, 
        ref=branch, 
        namespace="gold"
    )
    # 2) Check for failures
    if run_state.job_status.lower() == "failed":
        raise Exception(f"{run_state.job_id} failed: {run_state.job_status}")
        
    # 3) Merge the silver branch into main (publish to gold)
    assert client.merge_branch(source_ref=branch, into_branch="main"), (
        "Something went wrong while merging into main."
    )

Case Study: Mediaset

Challenge: Europe's 2nd largest broadcaster with 65M daily viewers. Breaking news events overloaded their Spark-based medallion architecture, causing dashboard failures during critical moments.

Solution: Rebuilt medallion architecture with Bauplan

Bronze

Raw event ingestion from 50+ sources

Silver

Standardized events with business validation

Gold

Real-time metrics and aggregations

Results:

Dashboard response: 45 seconds → 2 seconds

New data source integration: 2 weeks → 2 hours

Infrastructure cost reduction: 85%

System stability during peak events: Achieved

READ FULL CASE STUDY

Built with Bauplan

prefect

pandas

iceberg

Iceberg Lakehouse and WAP

Orchestrated Write-Audit-Publish pattern for ingesting parquet files to Iceberg tables.

Chris White

CTO @Prefect

RAG

Pinecone

OpenAI

RAG system with Pinecone and OpenAI

Build a RAG system with Pinecone and OpenAI over StackOverflow data.

Ciro

CEO @bauplan

PyArrow

Pandas

DuckDB

Data Quality and Expectations

Implement data quality checks using expectations.

Jacopo

CTO @bauplan

PDF

Open AI

Pandas

PDF analysis with OpenAI

Analyze PDFs using Bauplan for data preparation and OpenAI’s GPT for text analysis.

Patrick Chia

Founding Eng

duckdb

prefect

streamlit

Near Real-time Analytics

Build near real-time analytics pipeline with WAP pattern and visualize metrics with Streamlit.

Sam Jafari

Dir. Data and AI

dbt

CI/CD

marts

dbt-style Pipelines with CI/CD and Version Control

dbt workflows VS Bauplan pipelines with branching, testing, and CI/CD

Yuki Kakegawa

Staff Data Eng

Medallion Architecture

What is the Medallion Architecture?

Bronze

Silver

Gold

The Medallion architecture separation delivers for you and your team:

Why Teams Choose Bauplan for their Medallion Architecture

Pure Python, No Heavyweight Stack

Data Quality Built In

Integrated Versioning and Orchestration

Namespace Separation for Medallion Layers

How Bauplan Makes Medallion Simple

Import data into the bronze layer

Transform and check for data quality

Validate tables and promote to Gold

Integrations

Case Study: Mediaset

Bronze

Silver

Gold

Dashboard response: 45 seconds → 2 seconds

New data source integration: 2 weeks → 2 hours

Infrastructure cost reduction: 85%

System stability during peak events: Achieved

Built with Bauplan

Iceberg Lakehouse and WAP

RAG system with Pinecone and OpenAI

Data Quality and Expectations

PDF analysis with OpenAI

Near Real-time Analytics

dbt-style Pipelines with CI/CD and Version Control