Launching Bauplan MCP Server: the First Step towards the Agentic Lakehouse

Launching Bauplan MCP Server

Engineering

Duck Hunt: moving Bauplan from DuckDB to DataFusion

Arrow-Native and Community-Driven: Why DataFusion Won
Jacopo Tagliabue
November 5, 2025
TL;DR: we migrated our ephemeral SQL engine from DuckDB to DataFusion. We explain our decision process (no, it’s not because “it’s Rust”), what went well, and what not-so-well in the process, in the hope it helps other system builders with their own decisions and trade-offs.

Last week, we defaulted the Bauplan feature flag enable-df-query to true. One small change for our CLI, one giant change for our backend. Since then, all incoming SQL operations have been handled by Apache DataFusion instead of DuckDB.

While not generally available yet, Bauplan already runs hundreds of thousands of data pipelines for early customers across four AWS regions, making the switch an important milestone for both our small and mighty team and our customers. This is the story behind this change: after a bit of background on our setup and initial DuckDB fork, we unpack the motivations behind our choice, our switching protocol and how we see ourselves in this new ecosystem - what we hope to get, what we want to give.

Building an ephemeral query engine

In 2023, Bauplan started a lakehouse based on the insight that synchronous queries and asynchronous data pipelines (the two most common OLAP workloads) could be efficiently expressed and executed within a unified Function-as-a-Service (FaaS) model.

As DuckDB began to hit the mainstream, and running queries on your laptop was suddenly cool again, we were tinkering with a different idea: what if we ran the engine on ephemeral functions in the cloud, and provided the input data “in-flight” through a decoupled storage service? We bet early on Apache Iceberg and our own flavor of table versioning, and the blueprint for building a “deconstructed warehouse” promised a lot of flexibility (VLDB slides available on request):

  • by running the functions close to object storage, we could achieve lower latency and more throughput;
  • by decoupling data management (Iceberg scan, data branches, cache) and execution, we could go to market faster, focusing on differentiating features as SQL became commoditized;
  • by using Apache Arrow as the lingua franca for data in-flight, we could compose different technologies that would otherwise be incompatible.

As we prototyped the system, we hit one main limitation: DuckDB assumes full control of I/O. Since we needed to handle storage separately, we forked it and added a new API, EXPLAIN SCANS. This API splits a query into two parts: the S3 scans (with as much predicate pushdown as possible) and the rest of the SQL logic.

For example, a query like SELECT SUM(c1) FROM t WHERE f > 5 AND c LIKE "%ciao%" is broken down by our planner into three functions:

  1. I/O scan, ****which reads from S3. It takes input parameters (Table=t, Projections=c1, Filter=f>5) and outputs Arrow record batches for table t. Only part of the filter can be pushed down. The I/O runs in Bauplan’s “platform space,” outside of DuckDB, which at the time struggled with object storage quite a lot.
  2. Query operation. A DuckDB function that takes the Arrow output from step 1 and runs the remaining SQL (SELECT SUM(c1) FROM scan WHERE c LIKE "%ciao%").
  3. Arrow Flight server. Returns the final tuples to the client.

EXPLAIN SCANS gave us a clean way to separate data access from computation, letting ephemeral functions handle each piece independently.

As we prototyped this system, the major problem is that DuckDB expected to control I/O (as any normal database would), so we forked it and added a new API, EXPLAIN SCANS, which can split a query into two halves: the underlying S3 scans (with as much predicate push-down as possible) and the rest of the “SQL logic”. With that tool in hand, a query like SELECT SUM(c1) FROM t WHERE f>5 AND c LIKE "%ciao%"" could be parsed by our planner into three functions:

  1. An I/O scan. A function with input parameters (Table=t, Projections=c1, Filter=f>5), outputting Arrow record batches containing the physical representation of the Iceberg table t. Note that only one portion of the filter can be pushed-down and that S3 I/O is done in “platform space”, rather than on DuckDB, which was not working well on object storage at that time (for the db-nerds: EXPLAIN SCANS returns a semi-optimized plan, sitting somewhere between the logical and physical representation you would get through EXPLAIN).
  2. A query operation. A DuckDB-powered function taking the Arrow scan and returning (again, as Arrow) the final aggregation with whichever filters couldn’t be pushed down, e.g. SELECT SUM(c1) FROM scan WHERE c LIKE "%ciao%".
  3. An Arrow Flight server function. A function returning the tuples to the client.

As a result, EXPLAIN SCANS was pioneering in its own, weird way. Quite a bit before iceberg-DuckDB was a thing (in June 2025, iceberg-DuckDB was still, according to our tests, very buggy), and even before DuckDB S3 performance became competitive, this deconstructed design allowed us to go live with a large customer, and for the very first time realize the vision of “ephemeral SQL on open formats”: data lives in the customer’s bucket, and capabilities are provided by ephemeral functions built on demand to fulfill user requests.

Is that the end of the story? Of course not. The  “composable data system” approach was elegant in its way, and it fostered some novel ideas around multi-language smart caching. However, as our customer base grew and use cases multiplied, cracks started appearing. Some issues were specific to our own choices (“It’s not you, it’s me” kind of thing): for example, giving up on interleaving I/O and processing started to outweigh the benefits of caching, custom S3 readers and the pluggability of it.

However, some other issues were more DuckDB specific though (“It’s definitely you” kind of thing), e.g. weird memory spikes when dealing with Arrow objects and the growing cost of keeping our fork up to date. Worse, we slowly came to realize that DuckDB is more of an open-source product than an open-source project: even if DuckDB improved (which in fairness, it did: it has now better extension support and improved S3 performance compared to 2023), it never had the flexibility we needed to hack on the problems we wanted to.

By open-source product we don’t mean a commercial database with open source distribution (e.g. Influx), but something slightly less orthodox: it’s an open-source project that has a roadmap, focus and governance closer to that of a product. The zero-dependency approach makes it hard for the roadmap to be truly community-driven (DuckDB now even has its own lakehouse format!). Issues like the mysterious Arrow behavior mentioned above, for example, were DuckDB-specific, so we couldn’t lean on the existing Arrow code or the knowledge of the very broad and diverse Arrow community. The product mind-set also made it harder to pitch changes that we thought would make the system more friendly for builders, as opposed to users: understandably, users don’t care about fine-grained control on the optimizer and they won’t naturally push for a super-customizable parser. The “pay-per-feature” option was hard to justify, given the evolving nature of our needs, so it was time to look elsewhere for a more stable solution; not a product we could modify “in the margins”, but a project that we could mold and re-use, and contribute back to.

Picking the successor of  DuckDB

We have been friends with Paul Dix at Influx and the Remzi-Andrea Arpaci-Dusseau group at UMadison for a while. When Xiangpeng Hao (Remzi Ph.D. student) and Colin Marc (Bauplan’s founding engineer) entered our lives at approximately the same time, it looked like the sky was sending us a message: DataFusion.

After a year of production workloads at scale, the promised advantages of DataFusion were pretty clear. By switching, we would get an Arrow-first*,* embeddable engine that has comparable performance, but also one that’s:

  • hackable: not just flexible, but easy to extend, even for a company of our size;
  • community-driven: not just open-source, but built in a community-driven way, actively fostering a plurality of views and engagement from companies of all shapes and sizes, without financial commitments upfront (as it’s often the case, we did end up spending money, but in a way that was more in line with our values).

While DataFusion is arguably less polished than DuckDB, we are pretty comfortable with running experimental software as long as there is long-term alignment on where the project is headed (e.g. before forking, we ran the largest Nessie on the planet, and we have been the first and largest among Kuzu’s customers). And switching the engine would provide us a much needed opportunity to fix the issues that, independently from DuckDB, had plagued our design from the first days with Iceberg.

Our query path is now much more streamlined: we run the DataFusion parser directly in our planner against our branching Iceberg catalog. When the ephemeral worker (with the DataFusion engine) receives the resulting plan, it can “just” execute it as-is and then pass the tuples to the Arrow Flight function downstream. The new flow costs us some modularity, but gives us an out-of-the-box competitive parser / planner without custom code to maintain; not to mention access to some fantastic performance improvements the community has in the pipeline. It’s also useful across the stack, not just for SQL queries. As a concrete example, let’s look at the following Bauplan Python function, reading the taxi_fhvhv dataset and applying some transformation:


import bauplan

@bauplan.model(materialization_strategy='REPLACE')
@bauplan.python('3.11', pip={'polars': '1.33.0'}')

def my_taxi_transformation(
    trips=bauplan.Model(
        'taxi_fhvhv',
        columns=[
            'pickup_datetime',
            'PULocationID',
            'DOLocationID',
            'trip_miles',
            ],
        filter="pickup_datetime >= '2023-02-15T00:00:00-05:00'"
    )
):

import polars

# do something here

return df

The declarative input to this function – fetch all the taxi trips after February 15, and project to four columns – needs to be turned into an actual Arrow input at execution time. With DataFusion, we’ve replaced the old custom S3 read path with the same function we use for queries, i.e., by mapping the input to SELECT pickup_datetime, PULocationID … from taxi_fhvhv WHERE …. In other words, since all our I/O ops are relational algebra-ish operations, a performant runtime for SQL can be re-used also for pipelines in Python.

Finally, we were always pretty nervous about our DuckDB fork being the only C++ dependency, while the rest of Bauplan was in Go and Python. The switch to DataFusion came as a company-wide transition to Rust (at least for some parts of the stack) was already in progress, and has also helped to accelerate it.

Pulling the trigger

The bulk of the work was extricating EXPLAIN SCANS from the planner; the new code was relatively simple in comparison. The first step was making the new query path available for internal tests under a feature flag. By tracking the SQL queries running every day in production, we could test customer queries and build confidence in the correctness of the new path on a representative sample. Once we built enough conviction, we established a shared timeline with our customers and prepared for migration day, i.e., doing all the usual work that's involved in a significant, client-visible transition.

Pulling the trigger got us every benefit we did expect – that was the easy prediction! Many queries are now faster because of interleaving and better use of metadata, and the planning code is more streamlined and easier to debug. From a human-centric point of view, the migration was also a success. Organization-wise, the switch between a siloed dependency for database nerds and a shared team effort in a common language is already noticeable: contributing to and around the SQL engine now feels (as it should) like working on a shared building block instead of hacking an obscure DuckDB command we invented. Community-wise, we are thrilled by the very fast community feedback to our issues, and by the opportunity to making an immediate impact on the project.

Query latency for our public users before and after the switch (approx. twice as fast for p50).

Because we operate at the frontier of Iceberg-based data engineering, we were able to find and discover several upstream bugs from day one. Since our customers have pretty diverse workloads, they implicitly explore the query space quickly, surfacing bugs that we’ve been able to raise (and often fix) for the community as a whole.

Alas, there is no migration without issues, and ours was no exception. Notable rough edges include the following:

  • DataFusion has odd behavior when it comes to the case-sensitivity of table and column names; by default, it “normalizes” identifiers in SQL, even if that would cause the catalog lookup to fail. We found that it wasn’t always possible to turn off this behavior, either.
  • While the rust implementation of Iceberg is coming along nicely, the included DataFusion integration doesn’t support output partitioning, suffers from deadlocks, and has performance issues. We are maintaining a fork of iceberg-rust while fixes for these issues make it upstream, and we currently use iceberg-rust only to read metadata, instead relying on DataFusion’s robust parquet support for the actual scans.
  • We found and fixed one nasty (if slightly niche) correctness bug.
  • We also need to bring a few in-house optimizations along for the ride. We already introduced an optimization to DuckDB to simplify expressions involving dates: due to the prevalence of date-based filters in BI-generated queries, even “simple” transformations can account for up to 20x performance improvement. Porting the same optimization to DataFusion was a lot more straightforward than it was in DuckDB, thanks to DataFusion’s “bring your own optimization rules” approach.

So much to do and so little time

With the thrust of the migration out of the way, it’s time to think about what’s next. Two areas where we see a lot of opportunity for improvement are Iceberg compatibility (delete file support, better performance) and metadata caching (which promises to be extremely fruitful).

Caching is generally very dear to our heart. Since we got to know Xiangpeng quite well while hacking DataFusion for an optimization prototype (stay tuned for an upcoming blog post!), we decided to financially support his PhD project, LiquidCache, in partnership with InfluxDB. Not only is LiquidCache built on principles we are keen to adopt ourselves (we’re very interested in “active” caching, i.e. performing some computation at the caching layer to exploit the underlying semantics of Arrow tables), it’s also a chance to keep working at the intersection of academia and industry concerns.

Acknowledgements

La recherche en bases de données, c’est comme la Russie: pleine de marécages, et souvent envahie par les Allemands.(Almost) Roger Nimier

We’d like to thank the DuckDB inventors and contributors for the incredible product they built, and the great code base they maintain: Bauplan would not have been possible without its “spare parts”, and DuckDB was an important one.

Thanks to Ryan Curtin for explaining scans, go-karts, and many other things to us; to Paul Dix for setting Williamsburg’s shirt-fashion trends and for his pioneering support of open-source engines; to Andrew Lamb for the stewardship of a thriving, welcoming open-source community.

Thanks to Stephanie Wang and Ciro for their feedback on a preliminary version of this draft.

Finally, thanks to Colin Marc for giving us the last nudge we needed, and to all the Bauplaners who build and ship while I write and talk.

Share on

More From Our Blog

Love Python and Go development, serverless runtimes, data lakes and Apache Iceberg, and superb DevEx? We do too! Subscribe to our newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.