Data Lakehouse

What is a Data Lakehouse?

A data lakehouse brings the scale of a data lake together with the reliability of a warehouse.
With open formats and cloud object storage, it’s the most cost-effective way to move beyond your operational database and start building real analytics and AI.

Storage

Raw files (Parquet, JSON, CSV) live directly in object storage.

Data

Table formats like Apache Iceberg add transactions, schema evolution, and time travel.

Compute

Engines and frameworks stay separate from storage, so you can scale and adapt easily.

Why do I need a Lakehouse?

If your company has grown beyond what Postgres can handle for analytics, it’s time to unify your data. With a lakehouse, you can bring together data from your CRM, payments, and application databases, then join, standardize, and transform it for analytics and AI, all without jumping straight into an enterprise-scale stack.

Benefits of the data lakehouse

Unified and future-proof

Consolidate scattered databases into one system for analytics and AI on open formats in object storage.

Built to grow

Start small and scale to larger data volumes as your needs evolve, without getting locked into a vendor.

AI-ready

Combine SQL for queries and joins with Python for custom logic, feature engineering, and agent-driven workflows.

Bauplan is the simplest way to build a data lakehouse in this region of the multiverse

No infrastructure fragmentation

Bauplan unifies Iceberg tables, branching, and execution in one system. Spin up branches for medallion layers, run DAGs in isolation, and merge back seamlessly. No catalog wrangling, no custom glue code.

Pure Python, No Heavyweight Stack

Use pure Python and SQL. No Spark, no JVM, no cluster overhead. Bauplan handles execution with Arrow under the hood—you focus purely on transformations and logic.

Git-for-data

Bauplan brings Git-style workflows to your data. Branch, commit, and merge tables and pipelines with the same ergonomics developers already know. Every change is versioned, isolated, and reversible, so you can experiment safely and deploy with confidence.

It’s just code

No proprietary DSLs, no siloed UIs. Your Bauplan project is just a Python repo: data infrastructure is declared directly in code, reproducible, testable, and automation-friendly. No Dockerfiles, no Terraform script, no divergence between dev and prod.

You can’t scale complexity

Current Lakehouse platforms split workflows across different runtimes, interfaces and abstractions.
Bauplan unifies them into one.

	Explore	Build	Run
What you do	Ad hoc queries, dashboards	Develop pipelines, train and test models	Schedule, scale, and monitor jobs
Runtimes you manage	Warehouse SQL engine, BI server, Semantic Layers, JDBC/ODBC gateways	Python envs and package managers, Docker, single-node Spark or Ray or Dask, object storage, Hive/Glue Metastore	Orchestrators (Airflow/Prefect), Kubernetes, Spark or Ray clusters, object storage, Kafka/Kinesis, secrets manager, monitoring stack
Interfaces you use	BI UI, SQL editor, JDBC/ODBC drivers, Excel connectors	Notebooks, IDE, SDKs and CLIs, Docker Compose	Airflow UI and YAML, Spark Submit, kubectl, Terraform or Helm, CI/CD UI, VS Code
Abstractions you juggle	Tables, views, materialized views, metrics layer, UDFs	DataFrames, DAGs, models, notebooks, schemas, feature tables	Tasks, schedules, triggers, retries, resource configs, deployments, run IDs, SLAs, backfills

	Explore	Build	Run
What you do	Explore data, build dashboards	Build data pipelines, train/test models	Run pipelines and models reliably, at scale, on a schedule
Infrastructure	Functions as a service
Developer interfaces	SQL editor and IDE
Abstractions	Functions, Tables, Git

Build your Lakehouse in one day

Flow diagram showing data moving from Postgres through CDC tools into object storage, managed by Bauplan, and then powering BI dashboards and AI/ML applications.

Put data in object storage (CDC)

Capture change data from your source systems directly into S3, GCS, or Azure Blob. Learn More

import bauplan


client = bauplan.Client()

import_branch = 'your_import_branch'   
new_table = 'your_table'               
s3_uri = 's3://<bucket-name>/<object-key>'

# 1). Create a new data branch off "main" to isolate the import
client.create_branch(branch=import_branch, from_ref="main")

try:
    # 2). Create a new Iceberg table in the import branch from the S3 source
    client.create_table(table=new_table, search_uri=s3_uri, branch=import_branch)

    # 3). Import the data into the newly created table
    client.import_data(table=new_table, search_uri=s3_uri, branch=import_branch)
    print(f"✅ Data imported in '{new_table}'.")

except bauplan.exceptions.BauplanError as e:
    # 4). Catch and re-raise Bauplan-specific errors with a clear message
    raise Exception(f"🔴 The import did not work correctly: {e}")

Import data

Ingest into Iceberg tables with Write-Audit-Publish semantics: isolate in a branch, run validations, then merge. Simple, robust, and safe.

Transform data

Write Python functions and SQL queries. Use Polars, Pandas, or DuckDB for fast, expressive transformations. No Spark, no DSL, no docker, no Kubernetes.

@bauplan.model()
@bauplan.python('3.11', pip={'polars': '1.33.1'})
def silver_table(data=bauplan.Model('bronze_table')):
    import polars as pl
    
    # Convert input data into Polars and clean it
    df = data.to_polars()
    time_filter_utc = pl.Series(
        [pl.datetime(2022, 1, 1, 0, 0, 0, "UTC")]
    )[0]
    df = df.filter(pl.col("timestamp") >= time_filter_utc)
   
    return df.to_arrow()

import bauplan
import pandas as pd


client = bauplan.Client()

# query the table and return result set as an arrow Table
my_table = client.query(
    query="SELECT avg(age) AS average_age FROM titanic_dataset", 
    ref="main"
)
# efficiently cast the table to a other formats
df = my_table.to_pandas()

Query data

Run both synchronous queries and asynchronous jobs on Bauplan’s runtime. Develop interactively, then scale up without changing your code.

Integrate

Run pipelines in Bauplan, then expose curated Iceberg tables to Warehouses, Lakehouses and SQL engines, or connect tables directly to BI tools and query with Bauplan, or use our PySDK to work in your favorite notebook platform.