The lakehouse for AI-native data engineering

Build data pipelines from your repo with AI coding assistants. Bauplan turns data changes into branch-isolated runs and atomic publishes. All exposed as simple APIs your IDE, CLI, and reviews can reason about.

mediaset logoscops.ai logomoffin logotrust&will logointella logo
mediaset logoscops.ai logomoffin logotrust&will logointella logo
01

Everything is code, including the data plane.
Your data platform is your repo.

API-first data infrastructure for IDE-first workflows

Define transformations and runtime environments in code. Run from your IDE using the SDK and CLI with the same surface your AI assistant can call.

Safe, fast iteration on production data

Every run happens on an isolated, zero-copy data branch. Publish with atomic data merges, keeping production unchanged and preserve artifacts for inspection and reruns.

Production ready with no migration

Bauplan runs production-grade workloads without the overhead of traditional platforms. Deploy with single-tenant, private link and BYOC options. Your data stays in object storage, no data movement needed. SOC 2 Type 2 compliant with built-in isolation and access controls.

Bauplan infrastructure - Production ready with no migration
02

Integrations

Read more about Bauplan integrations in our docs.
03

Built With Bauplan

Examples from the field. Real data applications built with Bauplan.
SEE ALL EXAMPLES
Arrow
arrow white
prefect
pandas
iceberg

Iceberg Lakehouse and WAP

Orchestrated Write-Audit-Publish pattern for ingesting parquet files to Iceberg tables.

Chris White
CTO @Prefect
RAG
Pinecone
OpenAI

RAG system with Pinecone and OpenAI

Build a RAG system with Pinecone and OpenAI over StackOverflow data.

Ciro
CEO @bauplan
PyArrow
Pandas
DuckDB

Data Quality and Expectations

Implement data quality checks using expectations.

Jacopo
CTO @bauplan
PDF
Open AI
Pandas

PDF analysis with OpenAI

Analyze PDFs using Bauplan for data preparation and OpenAI’s GPT for text analysis.

Patrick Chia
Founding Eng
duckdb
prefect
streamlit

Near Real-time Analytics

Build near real-time analytics pipeline with WAP pattern and visualize metrics with Streamlit.

Sam Jafari
Dir. Data and AI
dbt
CI/CD
marts

dbt-style Pipelines with CI/CD and Version Control

dbt workflows VS Bauplan pipelines with branching, testing, and CI/CD

Yuki Kakegawa
Staff Data Eng
arrow left
arrow right
04

A whole data platform in your repo

Git-style control for data changes

Branch your data instantly

Create a zero-copy branch instantly. Use branches for development, changes, and safe backfills while main stays protected.

Transactional and declarative pipelines

A pipeline run behaves like a database transaction. Outputs merge on success. On failure, production stays unchanged.

Roll back anytime

Revert a bad publish in seconds by rolling back to the last known-good commit, with full history of what changed.

import bauplan 
client = bauplan.Client() 
# Create a new branch
my_b = client.create_branch(
    branch='import_branch',
    from_ref='main'
)
    
# Create a new table in the branch
new_table = client.create_table(
    table='your_table_name',
    search_uri='s3://bucket/*.parquet',
    branch=my_b
)

Declarative environments, zero infrastructure

import bauplan 

@bauplan.model()
# Define Python env with package versions in code
@bauplan.python(pip={'pandas': '2.2.0'})
def clean_data(data=bauplan.Model('my_data')):
    
  import pandas as pd
  df = data.to_pandas()
  df_cleaned = df.dropna()
  return df_cleaned

Pythonic and fully managed

Write transformations as Python functions and SQL. Bauplan handles execution, scaling, and table I/O. No configs, containers, or runtime plumbing.

Run from your IDE or your AI assistant

The workflow is code-addressable end to end: branch, run, validate, publish. You can drive it from your IDE or let an assistant execute it safely.

Code-first, tool-callable by design

A programmable data lakehouse

Write every step of your pipeline as code. Version everything — business logic, data, environments — just like software.

A small, typed API surface

A few predictable primitives (branch, query, run, commit, merge, inspect) give agents a reliable loop for iteration, validation, and publish.

Safer defaults for AI-assisted work

Branch by default, protect prod. Each write produces immutable references that capture code, inputs, outputs, and environment, so rollbacks and replay are straightforward.

import bauplan 

client = bauplan.Client() 

# Create a development branch 
_b = client.create_branch('my_b', from_ref='main') 

# Run the pipeline on it 
client.run('./my_project', ref=_b) 

# Inspect recent commits 
for commit in client.get_commits(ref=_b): 
     print(commit.message) 

# Merge changes into main 
client.merge_branch(_b, into_branch='main')
05

Latest from our blog

AI-first data engineering, Git-for-data semantics, and serverless execution over Iceberg
READ ALL POSTS
Arrow
arrow white
06

FAQs

What if I already have Databricks or Snowflake?
Plus

Great! Bauplan is built to be fully interoperable. All the tables produced with Bauplan are persisted as Iceberg tables in your S3, making them accessible to any engine and catalog that supports Iceberg. Our clients use Bauplan together with Databricks, Snowflake, Trino, AWS Athena and AWS Glue, Kafka, Sagemaker, etc.

What does Bauplan replace in my AWS data stack?
Plus

Bauplan consolidates pipeline execution and data versioning into one workflow: branch, run, validate, merge. You can keep S3 and your orchestrator; you remove a lot of cluster complexity and glue. For example, an Airflow DAG that spins up an EMR cluster, submits Spark steps, then runs an AWS Glue crawler to refresh the Glue Data Catalog before triggering downstream jobs becomes: Airflow triggers a Bauplan run on an isolated branch that writes Iceberg tables directly to S3.

How do you keep my data secure?
Plus

Your data stays in your own S3 bucket at all times. Bauplan processes it securely using either Private Link (connecting your S3 to your dedicated single-tenant environment) or entirely within your own VPC using Bring Your Own Cloud (BYOC).

Do I need to learn a new data framework or DSL?
Plus

No. Bauplan is just Python (and SQL for queries). That why your AI assistant can immediately write Bauplan code with no problem.

What does Git-for-Data mean?
Plus

Bauplan allows you to use git abstractions like branches, commits and merges to work with your data. You can create data branches for your data lake to isolate data changes safely and enable experimentation without affecting production, and use commits to time-travel to previous versions of your data, code and environments in one line of code. All this, while ensuring transactional consistency and integrity across branches, updates, merges, and queries. Learn more.