Launching Bauplan MCP Server: the First Step towards the Agentic Lakehouse

Launching Bauplan MCP Server

Engineering

Git-for-data: Formal Semantics Part 2: Branching, Merging, and Rollbacks

Bringing Database-level Guarantees to the Lakehouse
Ciro Greco
October 24, 2025

Git-for-data: Formal Semantics Part 2: Branching, Merging, and Rollbacks

See the repository in github.

Transactions and Schrödinger pipelines

Imagine we run a data pipeline that transforms raw events into customer-level aggregates. Step one finishes, but step two crashes, so our intermediate tables exist while the final ones do not. We have just created what we call a Schrödinger pipeline: a workflow suspended between success and failure, half-written yet already visible. Its results are both real and not until someone runs a query.

That partial visibility has practical consequences. Dashboards may show wrong KPIs. Feature stores may serve contaminated inputs to models, degrading predictions and causing silent drift. Alerts can fire or fail for the wrong reasons. Billing, compliance, and audit systems can record inconsistent states, making reconciliation expensive. Engineers waste time debugging transient errors instead of shipping features. In short, partial writes propagate quietly but widely.

In database systems this class of failure is solved by transactions, which guarantee that either all writes happen or none do. In a lakehouse, where compute and storage are separate, table-level atomicity cannot provide the same safety. A pipeline can update one table successfully and leave another untouched, producing a view of the world that never actually existed.

Some modern catalogs and table formats advertise features that sound transactional, like “atomic multi-table commits”. These claims are often true in a narrow sense: they guarantee that changes to table metadata are applied together (to one or even more table), but this is not the same as a transaction in the database sense: unless we update the rest of toolchain, the burden of correctness may still fall on the user side. Transactions are no joke: a recent study of 91 popular open source projects found that half of the ad hoc transactions built by application developers were faulty!

A key difference between your vertically integrated database (e.g. Postgres) and a lakehouse is architectural. In a database, compute and storage live inside one coordinated engine that can move all changes forward or roll them all back. In a lakehouse, the catalog only tracks metadata while the actual computation happens on decoupled processes, for instance on Spark or a SQL engine. The two layers speak different languages and move on separate clocks.

A pipeline connects those layers. It reads from storage, executes logic through ephemeral compute, and writes back new table versions. As long as those layers remain independent, no single mechanism can guarantee atomicity across the entire process.

This leaves an open question: can a data pipeline running on a lakehouse behave transactionally, so that downstream systems either see all its results or none?

Clone the repo and tag along.

Previously on Bauplan…

In Part 1, we used formal modeling to prove that a lakehouse built on immutable table snapshots, commits, and branches can model predictable merges similar to what happens in Git with code (i.e. “Git-for-Data”).

This model, however, describes the system at rest, when commits already exist and we reason only about their relationships. What is not captured is the execution phase, when those commits are being created by running pipelines, which are often modeled as DAGs of transformations that read existing tables and produce new ones.

In this article, we add pipelines to our model in order to capture the dynamic aspect of writing data by running pipelines. In particular, we are interested in finding out if the primitives used so far are enough to guarantee a pipeline running transactionally, or if something else is needed: by using Alloy, we get for free nice counterexamples every time our concepts are not strict enough, which in turn enforce a new iteration on the conceptual model.

Why Per-Table Atomicity Isn’t Enough

Table formats such as Apache Iceberg or Delta Lake guarantee that each individual table update is atomic. That protection stops the moment a pipeline spans multiple tables or steps. A failed transformation, a retry, or a partial write can still leave the system in an inconsistent state.

Consider a simple pipeline:

Read raw events → Create an intermediate table int_customers → Build a final table fact_revenue

If step two finishes but step three crashes, the data lake now contains a new version of int_customers and an old version of fact_revenue. To downstream consumers the system looks consistent (it runs, returns data, and even passes basic checks). However, the numbers no longer align, because we now have a new version of an intermediate table but an old version of the final one.

This kind of failure appears in many real situations, such as for example pipeline that recreate partitions within a run (in an incremental fashion), and may leave only some tables re-written, and some in the old state.

The core point here is that pipelines rarely touch a single table. Every production DAG combines many tables and transformations, often executed across distributed compute engines, so when something fails mid-run, atomicity at a table level does nothing to preserve global consistency.

As we discussed above, the consequences can be subtle but very expensive. Even outside of catastrophic scenarios, data engineering teams might end up wasting  a lot of time rerunning pipelines and manually reconciling states that never should have diverged.

Modeling transactions with branches

Branches as multi-table isolation

In Part 1, we introduced the concept of a data branch, which is a pointer to a chain of commits representing the full state of the data lake at a moment in time. Branches allow multiple users or processes to work safely on the same data without interfering with each other.

The ability of isolating multiple tables makes them the natural starting point for thinking about multi-table consistency. If atomic tables are not enough to offer transactional guarantees for pipelines, maybe multi-table Git-style branching can.

The answer, is not entirely. Even with multi-table data branches, the system can still record a series of valid commits that together describe an invalid state of the data lake as a whole.

Imagine a team running a pipeline on a feature branch derived from main. The first step aggregates raw events into int_customer_activity, the second step produces the final table fact_customer_metrics. When the first step finishes, the branch now points to a new snapshot of int_customer_activity. Before the second step completes, a failure stops the run. Not ideal, but nothing catastrophic: from the branch / table perspective, there is nothing inconsistent here.

The problem appears if someone merges. If that branch were merged into main, production would move to a state that never existed in the pipeline logic with a mix of new intermediate results and old final outputs.

A branch isolates concurrent work, but it does not know whether a pipeline is finished or in progress. A pipeline, by contrast, has a beginning, an ordered set of steps, and a defined success condition. So, unless the branch and the pipeline run are connected, the system cannot tell the difference between “work in progress” and “work ready to publish.”

Coupling Branches and Execution

The missing ingredient is to connect branching to execution, because we should automatically merge the branch back into production only if the pipeline completes successfully.

In other words, our data lakehouse must understand that a pipeline run is not just a chain of writes but a transaction in progress and treat an entire pipeline run as a single atomic event: either fully applied or not visible at all.

If every pipeline run automatically operates within its own isolated branch, and if that branch merges atomically into production only when the run succeeds, then the system can emulate transactional behavior at the pipeline level. The mechanics is fairly straightforward:

  1. A run always happens on a temporary branch (from the current one, e.g., main), automatically created by the system at the start.
  2. Each step in the pipeline writes on that temporary branch.
  3. On success, the system performs an atomic merge into the current branch.
  4. On failure, the branch remains unmerged, preserving the current branch and the temporary branch available for debugging.

Downstream readers never see intermediate results, because nothing is published until the final merge completes. We retain the isolation of failures provided by the branches while gaining full reproducibility of intermediate commits for debugging.

This pattern converts a multi-step workflow into a single observable event from the point of view of the lakehouse: either a merge happens or it does not.

In Bauplan, this principle is implemented automatically. When a user runs bauplan run, the platform creates a dedicated branch for that execution and all transformations write to that branch. If the run goes well, and expectations and validation tests pass, Bauplan merges the data branch back into main. If any test fails, the merge is suspended, leaving the production environment in a consistent state and preserving the failed run for inspection.

By wrapping compute (the pipeline run) inside data versioning semantics (branches and merges), our lakehouse gets the same consistency that databases provide internally through transactions, while maintaining the decoupling of compute and storage.

In the next section, we examine where this analogy holds and where it breaks, because branches may look like transactions, but they do not always behave like them.

Lightweight modeling of isolation with Alloy

Databases enforce consistency because they control data and compute together with transactions, catalogs do not. However, not all hope is lost. If compute and data are managed together during pipeline execution, and we control concurrent changes through branches, then we have replicated a database through Git abstractions and arbitrary, decoupled compute, haven’t we?

Well, let’s test it before jumping to hasty conclusions. To verify whether the analogy truly holds, we use the same tool as in Part 1, Alloy, a lightweight formal modeling language that lets us describe system rules precisely and automatically search for counterexamples.

Our goal is to test a specific guarantee: no half-written pipeline should ever become visible, even when we use temporary branches. The broader question that we want to answer here is: are there things users are allowed do with branches that they are not allowed to do with transactions?

What Alloy Shows

The model reveals a subtle counterexample:

  1. User 1 runs a pipeline (on a temporary branch), which fails half-way, with only some tables successfully materialized.
  2. User 2 creates a branch from the temporary branch right after the run.
  3. User 2 merges their branch into main.
  4. main now exposes only part of User 1 pipeline’s intended changes.

This means that branches can be “nested”, even by multiple users, while transactions cannot — User 2 cannot really “pick up” a User 1 transaction. In the model, a “temporary” branch is just a branch (visible, forkable, and mergeable), so users can leak partial results: the fact that a “temporary” branch is not special is elegant, but comes indeed at a cost.

This counterexample shows that the analogy fails here and that branches can imitate transactions, but they are not exactly the same.

What This Means in Practice

A real system can contain this problem with guardrails and automation, e.g.:

  • Block merges from branches created from a temporary branch.
  • Mark temporary branches as “special”: users cannot start a new branch out of them.

These are operational rules, which will block the unintended scenario by effectively change the definition of our primitives; in this case, by limiting the concept of branching in some way or another.

Alloy exposes the exact limits of our guarantees, showing that branches can simulate transactional behavior only when execution and publishing are coupled, and it pinpoints where that simulation breaks. Instead of vague confidence, we now have a defined boundary and a design space we can engineer safely within.

See you soon

Obviously, these models are far from a full model of our system, or any other lakehouse built on similar premises. However, even simple models that make very few, very general assumptions may be surprisingly useful in keeping our informal reasoning in check, and keeping our claims honest.

We are now bringing these lessons back into product design - we are a company after all. The next stage is expanding capabilities into stronger guarantees in our branching APIs, richer pipeline recovery semantics, and deeper validation of runtime behavior against the formal spec.

On top of that, fear not, we also look forward to sharing our results with the community, with new proofs and much larger models.

Acknowledgments

Our formal work is being carried out with the help of Manuel Barros (CMU), Jinlang Wang (University of Wisconsin, Madison), Weiming Sheng (Columbia University). In particular, Jinlang and Manuel devised the Alloy model used for this blog post.

Share on

More From Our Blog

Love Python and Go development, serverless runtimes, data lakes and Apache Iceberg, and superb DevEx? We do too! Subscribe to our newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.