Bauplan. A year in review

And our (Bau)plan for the next one

Ciro Greco

Jacopo Tagliabue

Mattia Pavoni

Dec 30, 2025

The days are long, the years are short

2025 was the year Bauplan stopped being an idea and became a system that people rely on to run mission critical data pipelines and analytics.

After we publicly unveiled the company at Data Council and announced our seed round, we moved from design partners to customers running real production workloads. Today, Bauplan runs more than 200,000 jobs per week.

Bauplan is adopted where the cost of a bad data change is high and the need to ship confidently is mission critical. We support small teams that need to move fast without incidents, large organizations that require strict guarantees, and increasingly autonomous agents that must operate safely without constant human supervision. Our customers use Bauplan to build, test, and publish Silver and Gold layer pipelines on their lakehouse with software-grade safety.

We are grateful to the many people who interviewed us, invited us to events, conferences, wrote reviews and tutorials, used our product to build something, cited our work, and gave us unfiltered feedback. You are too many to name individually, but thank you.

To celebrate the end of the year we want to close it with something old, something borrowed and something new.

Something old: Python + Open Formats

When we started Bauplan, we bet on a Python-centric data landscape built on open formats backed by object storage. That felt niche at the time; in 2025, a consensus emerged at the intersection of two big waves.

Python. Python has become the user-facing layer for the next generation of non-JVM data engine s. Its momentum is being pulled by two unstoppable forces: AI workloads are overwhelmingly written in Python, and AI coding tools operate best on code-first systems where Python is the primary surface. In 2024 Python overtook JavaScript on GitHub, and its rise is explicitly tied to AI.

Now, because Python is everywhere in open source, notebooks, and AI repos, it is heavily represented in the code corpora LLMs learn from. This makes Python one of the preferred media for your AI coding assistant like Claude Code and Cursor, which in turn pushes data engineering toward code-defined pipelines and away from UIs. More on this below.

Iceberg. On the open format front, Apache Iceberg moved from being a promising open-source project to becoming the lingua franca for interoperability over data that lives in the customer’s own bucket, with no strings attached. Today, all major data platforms support Iceberg, which reflects the ecosystem’s shift toward a common, vendor-neutral format.

Our early commitment to Iceberg paid off handsomely: it made possible for teams to adopt Bauplan without migrating data into a new proprietary store and extend compatibility alongside existing platforms like Snowflake, ClickHouse, Dremio and Databricks.

The Open Lakehouse vision is now live with our customers and a growing set of zero-copy integrations to our left and to our right. For instance, Trust & Will, a once warehouse-or-nothing shop, is now an open, thriving lakehouse.

Want to know more?

Blog: A Lakehouse in a Few Days (Trust & Will client story)
Papers: Keynote CDMS @ VLDB 2025, Eudoxia (open-source FaaS simulator)
Podcast: From Functions to AI Agents: Reimagining the Lakehouse

Something borrowed: from DuckDB to DataFusion

Bauplan’s vision of re-imagining the data lifecycle “as code” would not have been possible without a pragmatic, composable engineering philosophy: we focus on the verticalized experience and a FaaS runtime, and we lean on “spare parts” for capabilities that are not truly differentiating (or not yet).

Following the first ever reference implementation of an ephemeral query engine on FaaS compute over object storage, we evolved a custom SQL processing pipeline that involved a novel I/O cache and a DuckDB fork for planning over Iceberg tables, and executing over Arrow streams. As pioneering as it was, last November we decommissioned the DuckDB-based code path, and moved our SQL planning and execution to Apache DataFusion because the cost of maintaining a fork outweighed the benefits.

The Bauplan stack that enters 2026 is considerably simpler, more Rust-y, and full of exciting possibilities now that we have embraced one more (after PyIceberg, Kuzu, Nessie etc.) community to be part of and contribute to. Working within the DataFusion ecosystem also strengthens our relationships with companies, institutions and people we look up to, such as Paul at InfluxDB, Remzi and Andrea’s group at the University of Wisconsin-Madison (Tyler, Xiangpeng), Federico at Together AI and James at Stanford.

We look forward to sharing with the data community how our stack continues to evolve as more customers and new workloads ramp up on the platform.

Want to know more?

Blog: moving from DuckDB to DataFusion
Paper: LiquidCache, which we co-sponsor with our friends at InfluxData.
Video: NYC System meetup talk on building from “spare parts”.

Something new: AI for data engineers

2025 was the year of agents. On one side there were “LinkedIn agents”, polished demos that stop at query or dashboards generation. On the other side, coding agents started interacting with real systems: repositories, APIs, and production environments in ways that would have felt miraculous just 18 months ago.

In data engineering, this immediately exposed a generational problem. Existing data platforms are designed around shared state, implicit side effects, and long human-in-the-loop workflows, which makes them fragile when autonomous systems enter the picture.

In 2025, we leaned into this reality by shipping our MCP server as a fully supported code-addressable interface. In the past months, we have seen remarkable adoption (even to our own surprise!) in our customers, who started hooking Bauplan MCP to their Cursor / Nao / Claude Code workflow.

Customers like Trust & Will use Bauplan MCP on a daily basis along with other MCP servers like dbt, to explore data, reason about lineage, and genuinely generate and run new pipelines. The team at Moffin is doing the same, moreover they are rolling out Bauplan MCP to provide a unified interface to their own customers who interact with their platforms in plain English.

The unifying thread is that Bauplan plus MCP provides a remarkably simple interface for the entire analytics lifecycle: authoring queries, running them, validating results, running pipelines and publishing changes, so teams can use AI not only to generate business logic but also to manage the underlying data infrastructure.

Want to know more?

Blog: Bauplan MCP release
Paper: Safe, Untrusted, “Proof-Carrying” AI Agents: toward the agentic lakehouse
Podcast: Ciro and Joe Reis on Why AI Agents Need a New Lakehouse

The future is not read-only

The success of Bauplan + AI Agents is due to the ergonomics of the platform. Agents do not reason about dashboards, UIs or implicit states, they reason about APIs and Bauplan is designed as a full data platform as code: every operation is explicit and programmable: branching, execution, validation, and publishing.

Even if 2025 was the year of agents, something fundamental is still missing for data engineering. Most progress has focused on the read path: querying data, generating SQL, and explaining results. The write path is largely untouched.

This matters because data engineering is really about ops-y kinda stuff. For instance, writing new transformations that can affect dozens of tables at once, running backfills, changing schemas safely, fixing broken pipelines, and publishing new artifacts without breaking production and so on. If those actions are not expressed explicitly as code and executed in isolated environments, agents cannot be trusted to perform them autonomously. Without a safe write path, we are not really doing data engineering.

That is the harder problem, and the one we are explicitly solving: automation on the write path. Bauplan is designed for untrusted, asynchronous actors by default. That is why we invested early in Git-for-Data semantics, branch-based isolation, and multi-table commits. It is also why we continue to formalize these guarantees, from Alloy models to MVCC-style correctness, as prerequisites for trustworthy automation rather than optional best practices.

Over the past year, we tested these ideas in practice. We worked with AI companies to ship agent-driven ETL code running in real cloud environments. We introduced self-repairing data pipelines as a concrete benchmark: if an agent can diagnose a failed run, fix it, and safely publish the result without corrupting production, then automation is doing something meaningful.

Given the success of “AI-assisted Bauplan” mode, we will soon release Agent Skills as well. As it turns out, the uniform, self-documented API surface of the platform plays well with lightweight abstractions for LLMs. The mapping between APIs and capabilities is extremely simple to write down for LLMs, as code, as it is for humans to interpret and oversee.

Onto the next year

In September, we met Aditya Parameswaran and collaborators at VLDB, where they argued that today’s data infrastructure is built for small groups of trusted humans, not for autonomous agents that operate asynchronously and without alignment guarantees.

That raises a broader question. Is this the time to rethink data infrastructure from the ground up, the same way we had to rethink it for the cloud? To revisit assumptions that made sense for human-only workflows and replace them with systems designed for autonomous actors?

Yes, we think so. That is why we welcomed Aditya as a special advisor, to help us build toward that vision.

2025 made one thing extremely clear: the limiting factor for AI in data engineering is not intelligence, it is infrastructure. Coding agents can already generate SQL and Python, inspect schemas, and propose fixes, but most data stacks still cannot let them execute changes end-to-end in a safe way because the write path is fragile or implicit.

Bauplan is built to close that gap. We are taking this work into 2026, starting with Data Day Texas and two talks at AAAI. Come find us there.

Share on

Bauplan. A year in review

The days are long, the years are short

Something old: Python + Open Formats

Something borrowed: from DuckDB to DataFusion

Something new: AI for data engineers

The future is not read-only

Onto the next year

More From Our Blog

We solved trust for AI Agents in 1973 (we just forgot)

Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance

Rethinking Data Pipelines in Python