Launching Bauplan MCP Server: the First Step towards the Agentic Lakehouse

Launching Bauplan MCP Server

Data Lakehouse

Unify your data for analytics, machine learning, and AI on a single scalable, open foundation. Without the complexity of a platform team.

Quote from Carlos, Director of Data engineering @Moffin. Testimonial for Bauplan
01

What is a Data Lakehouse?

A data lakehouse brings the scale of a data lake together with the reliability of a warehouse.
With open formats and cloud object storage, it’s the most cost-effective way to move beyond your operational database and start building real analytics and AI.

Storage

Raw files (Parquet, JSON, CSV) live directly in object storage.

Data

Table formats like Apache Iceberg add transactions, schema evolution, and time travel.

Compute

Engines and frameworks stay separate from storage, so you can scale and adapt easily.

02

Why do I need a Lakehouse?

If your company has grown beyond what Postgres can handle for analytics, it’s time to unify your data. With a lakehouse, you can bring together data from your CRM, payments, and application databases, then join, standardize, and transform it for analytics and AI, all without jumping straight into an enterprise-scale stack.

Benefits of the data lakehouse

Unified and future-proof

Consolidate scattered databases into one system for analytics and AI on open formats in object storage.

Built to grow

Start small and scale to larger data volumes as your needs evolve, without getting locked into a vendor.

AI-ready

Combine SQL for queries and joins with Python for custom logic, feature engineering, and agent-driven workflows.

03

Bauplan is the simplest way to build a data lakehouse in this region of the multiverse

No infrastructure fragmentation

Bauplan unifies Iceberg tables, branching, and execution in one system. Spin up branches for medallion layers, run DAGs in isolation, and merge back seamlessly. No catalog wrangling, no custom glue code.

Pure Python, No Heavyweight Stack

Use pure Python and SQL. No Spark, no JVM, no cluster overhead. Bauplan handles execution with Arrow under the hood—you focus purely on transformations and logic.

Git-for-data

Bauplan brings Git-style workflows to your data. Branch, commit, and merge tables and pipelines with the same ergonomics developers already know. Every change is versioned, isolated, and reversible, so you can experiment safely and deploy with confidence.

It’s just code

No proprietary DSLs, no siloed UIs. Your Bauplan project is just a Python repo: data infrastructure is declared directly in code, reproducible, testable, and automation-friendly. No Dockerfiles, no Terraform script, no divergence between dev and prod.

You can’t scale complexity

Current Lakehouse platforms split workflows across different runtimes, interfaces and abstractions.
Bauplan unifies them into one.
Explore Build Run
What you do Ad hoc queries, dashboards Develop pipelines, train and test models Schedule, scale, and monitor jobs
Runtimes you manage Warehouse SQL engine, BI server, Semantic Layers, JDBC/ODBC gateways Python envs and package managers, Docker, single-node Spark or Ray or Dask, object storage, Hive/Glue Metastore Orchestrators (Airflow/Prefect), Kubernetes, Spark or Ray clusters, object storage, Kafka/Kinesis, secrets manager, monitoring stack
Interfaces you use BI UI, SQL editor, JDBC/ODBC drivers, Excel connectors Notebooks, IDE, SDKs and CLIs, Docker Compose Airflow UI and YAML, Spark Submit, kubectl, Terraform or Helm, CI/CD UI, VS Code
Abstractions you juggle Tables, views, materialized views, metrics layer, UDFs DataFrames, DAGs, models, notebooks, schemas, feature tables Tasks, schedules, triggers, retries, resource configs, deployments, run IDs, SLAs, backfills
Explore Build Run
What you do Explore data, build dashboards Build data pipelines, train/test models Run pipelines and models reliably, at scale, on a schedule
Infrastructure Functions as a service
Developer interfaces SQL editor and IDE
Abstractions Functions, Tables, Git
04

Build your Lakehouse in one day

Flow diagram showing data moving from Postgres through CDC tools into object storage, managed by Bauplan, and then powering BI dashboards and AI/ML applications.

Put data in object storage (CDC)

Capture change data from your source systems directly into S3, GCS, or Azure Blob. Learn More

Data pipeline from Postgres to Bauplan through CDC tools like Kafka, Airbyte, Fivetran, and dltHub.
import bauplan


client = bauplan.Client()

import_branch = 'your_import_branch'   
new_table = 'your_table'               
s3_uri = 's3://<bucket-name>/<object-key>'

# 1). Create a new data branch off "main" to isolate the import
client.create_branch(branch=import_branch, from_ref="main")

try:
    # 2). Create a new Iceberg table in the import branch from the S3 source
    client.create_table(table=new_table, search_uri=s3_uri, branch=import_branch)

    # 3). Import the data into the newly created table
    client.import_data(table=new_table, search_uri=s3_uri, branch=import_branch)
    print(f"✅ Data imported in '{new_table}'.")

except bauplan.exceptions.BauplanError as e:
    # 4). Catch and re-raise Bauplan-specific errors with a clear message
    raise Exception(f"🔴 The import did not work correctly: {e}")

Import data

Ingest into Iceberg tables with Write-Audit-Publish semantics: isolate in a branch, run validations, then merge. Simple, robust, and safe.

Transform data

Write Python functions and SQL queries. Use Polars, Pandas, or DuckDB for fast, expressive transformations. No Spark, no DSL, no docker, no Kubernetes.

@bauplan.model()
@bauplan.python('3.11', pip={'polars': '1.33.1'})
def silver_table(data=bauplan.Model('bronze_table')):
    import polars as pl
    
    # Convert input data into Polars and clean it
    df = data.to_polars()
    time_filter_utc = pl.Series(
        [pl.datetime(2022, 1, 1, 0, 0, 0, "UTC")]
    )[0]
    df = df.filter(pl.col("timestamp") >= time_filter_utc)
   
    return df.to_arrow()
import bauplan
import pandas as pd


client = bauplan.Client()

# query the table and return result set as an arrow Table
my_table = client.query(
    query="SELECT avg(age) AS average_age FROM titanic_dataset", 
    ref="main"
)
# efficiently cast the table to a other formats
df = my_table.to_pandas()

Query data

Run both synchronous queries and asynchronous jobs on Bauplan’s runtime. Develop interactively, then scale up without changing your code.

Integrate

Run pipelines in Bauplan, then expose curated Iceberg tables to Warehouses, Lakehouses  and SQL engines, or connect tables directly to BI tools and query with Bauplan, or use our PySDK to work in your favorite notebook platform.

Diagram of Bauplan architecture showing Runtime, Git-for-Data, and Iceberg Tables on top of a lakehouse with object storage, connecting to BI tools (Looker, Metabase, Superset), data apps (Jupyter, Streamlit, Hex, Marimo), and warehouses (Snowflake, Databricks, Amazon Redshift).
04

Integrations

Prefect logoprefect logo
Orchestra logoOrchestra logo
dbos logoDbos logo
Apache Airflow logoApache Airflow logo
arrow left
arrow right
05

Built with Bauplan

prefect
pandas
iceberg

Iceberg Lakehouse and WAP

Orchestrated Write-Audit-Publish pattern for ingesting parquet files to Iceberg tables.

Chris White
CTO @Prefect
RAG
Pinecone
OpenAI

RAG system with Pinecone and OpenAI

Build a RAG system with Pinecone and OpenAI over StackOverflow data.

Ciro
CEO @bauplan
PyArrow
Pandas
DuckDB

Data Quality and Expectations

Implement data quality checks using expectations.

Jacopo
CTO @bauplan
PDF
Open AI
Pandas

PDF analysis with OpenAI

Analyze PDFs using Bauplan for data preparation and OpenAI’s GPT for text analysis.

Patrick Chia
Founding Eng
duckdb
prefect
streamlit

Near Real-time Analytics

Build near real-time analytics pipeline with WAP pattern and visualize metrics with Streamlit.

Sam Jafari
Dir. Data and AI
dbt
CI/CD
marts

dbt-style Pipelines with CI/CD and Version Control

dbt workflows VS Bauplan pipelines with branching, testing, and CI/CD

Yuki Kakegawa
Staff Data Eng
arrow left
arrow right