Launching Bauplan MCP Server: the First Step towards the Agentic Lakehouse

Launching Bauplan MCP Server

Rethinking Data Pipelines in Python

Eliminating cross-runtime complexity in your data stacks
Ciro Greco
Dec 5, 2025

Intro

Python became the default language for ML and AI because it gave developers a single, expressive environment for modeling, experimentation, and deployment.

Data engineering never made the same transition. Most teams still run pipelines where the orchestration is Python-friendly, but the actual data processing still depends on JVM engines, Spark clusters, container fleets, or warehouse-specific SQL dialects. The result is a split-brain model: Python for the control plane, something entirely different for the data plane.

This is what we are going to talk about.

Python Everywhere

Over the past 15 years Python evolved from a scripting tool into the most widely used language, especially in AI, machine learning, and data science (2024 Github Octoverse).

Besides Python’s syntax being more concise and approachable than many other languages used for scientific computing or production infrastructure (C++, Java, Scala), I think that one of the main reasons for it Python being successful is this: you can do pretty much everything in it, from data cleaning to model building to deployment. You can use NumPy and SciPy for vectorized numerics, Pandas and Polars for tabular data , Scikit‑learn for ML, and PyTorch for deep learning.

Orchestration followed a similar trajectory. As the industry moved beyond cron jobs and GUI-based ETL tools, developers gravitated toward workflow systems that treated pipelines as code. Python became the natural choice for this style of orchestration because it was flexible, general-purpose, and already widely adopted. Airflow was the first major example, and later Prefect pushed the idea further by letting teams express scheduling, retries, and dependencies as ordinary Python functions.

The result was that, by the time organizations began operationalizing ML workloads, the control plane already lived in the same environment as their experimentation. Modeling and orchestration spoke the same language.

But Not for Data

The data layer, however, did not. Even if Python is the go-to language for so many data related jobs, developers are pushed to JVM-based engines like Spark, Presto or Trino, as data volume and workload complexity grew. Why, you may ask? There are few reasons.

First, when JVM-based big data engines were first developed machines didn’t use to be as large as today. A lot of data workloads (especially those involving heavy transformations) that today could comfortably fit in a large machine, required distributed processing. Python’s runtime and concurrency model made it a poor fit for these early distributed-compute systems: it is interpreted, dynamically typed, and slower on CPU-bound workloads than JVM languages, which benefit from JIT compilation and predictable memory management.

Second, the way big-data engines interact with storage makes Python inefficient at scale. Most data catalogs and table formats were designed for big-data engines. Hive Metastore, Glue Catalog, Apache Iceberg, and Delta Lake perform metadata planning, partition pruning, predicate pushdown, and snapshot management inside JVM-native engines.

Without all these nice things, the Python developer is left with reading raw Parquet or CSV files from S3 with not optimizations. As a result, operations that should read a few partitions end up scanning far more data than necessary, which again is a heavy tax to pay at scale where having high I/O, slow performance, and wasted compute resources is more hurtful.

Third (but it’s a less data-specific reason), using Python in large or cloud-based data pipelines often runs into dependency management problems. The broader Python ecosystem has thousands of packages and different pipelines or services need different versions of libraries which can increases friction when deploying Python pipelines in shared or cloud environments (”it works on my machine!”).

So in the end, the common knowledge is that when it comes to data processing Python is for boys, the JVM is for men.

The great divide

The result is that data pipelines become hybrid systems by necessity: Python for orchestration, Spark or Trino for the actual execution, and a JVM-native catalog to manage schema evolution and table metadata. Python APIs like PySpark might give the impression that you can stay within a single language, but they are thin wrappers over a JVM runtime. When something happen, you still end up debugging the engine beneath them, not the Python layer above.

The deeper problem is that engineers spend more time navigating boundaries than building. Control flow, retries, and error handling live in Python, but the core transformations live in distributed JVM or SQL engines. Storage and metadata add another runtime entirely. That forces developers to switch mental models constantly.

In practice, a “Python pipeline” is usually Python orchestrating code whose real behavior, performance, and failure modes come from a completely different system. As soon as something breaks, the abstraction leaks and debugging requires chasing logs across runtimes. Performance tuning and understanding how the catalog interprets snapshots and manifests requires non-trivial JVM knowledge.

This fragmentation is one of the primary reasons for pipelines to become fragile. Pipelines break not because orchestration tools are weak, but because engineers lack visibility into what’s happening inside the data itself (late-arriving records, partial batches, silent schema drift, column-level inconsistencies, file skew, or misaligned metadata).

It’s Python al the way down

If the core problem is that Python became the control plane but never became the data plane, the natural question is: what would it take to make data processing itself behave like a first-class Python system? Not a wrapper over a JVM engine or a client sprinkled around object storage, but a full Lakehouse engine that speaks Python natively while retaining the performance, correctness, and guarantees of modern table formats.

Prefect already gives teams a clean, Python-native orchestration model. Pipelines are ordinary functions, instead of DSLs. Triggers, retries, and transactional flow execution run inside the same language people use for analysis, modeling, and experimentation. Developers stay in Python because orchestration does not require anything else.

Bauplan provides a Lakehouse engine that exposes table operations, data branching, versioning, processing optimization, and scalable execution through a Function-as-Service Python runtime.

Instead of treating Python as a thin wrapper around Spark or Trino, Bauplan executes transformations, metadata operations, and isolation semantics natively, inside a cloud runtime that is purpose-built for Python.

The consequence is that the same pipeline that Prefect orchestrates in Python is also executed in Python with no language boundary. There is no hidden engine, no serialization overhead and no need to mentally translate between Spark operators, Iceberg manifests, and orchestration. Developers write transformations, quality checks, table mutations, and metadata operations as ordinary Python functions. Bauplan handles optimization under the hood (pushdowns, partition pruning, metadata planning) in a way that matches what JVM systems do, but without exposing the runtime boundary to the user.

This eliminates the need for hybrid systems. The “outer loop” of scheduling and control (flows, retries, rollbacks, dynamic triggers) runs in Prefect. The “inner loop” of data manipulation (ingestion, transforms, validation, table updates) runs directly in Bauplan. Both are Python. Both share the same semantics. Both operate on the same versioned state.

What used to require orchestration in Python and execution in a JVM stack becomes a coherent, single-language workflow. Instead of pipelines that scatter logic across Python files, Spark notebooks, SQL scripts, and JVM-based catalogs, you get a pipeline whose control and data plane are unified. This solves two structural problems at once:

  1. Data manipulation becomes safer because Bauplan’s engine isolates every operation on its own branch, versioning all changes and providing point-in-time visibility for debugging.
  2. Operational workflows become simpler because Prefect can coordinate the entire lifecycle without crossing runtimes.

The surprising thing is how little code is required once both layers speak the same language.

You can easily build a end-to-end data pipeline, with data quality testing and guarantees (e.g. write-audit-publish patterns) that would normally span multiple tools becomes a small, ordinary Python script.

Task-level transactions in Prefect complement data-level transactions in Bauplan: when a task fails, Prefect manages retries or rollback and when data transformation fails, Bauplan preserves the branch and its state.

See you, Python cowboy

In the end, the Bauplan–Prefect combination collapses a split that has shaped data engineering for a decade. Instead of orchestrating Python around a JVM engine, you build pipelines where orchestration and execution live in the same language, follow the same semantics, and operate on a shared, versioned state. You keep the ergonomics and accessibility of Python, without giving up metadata planning, pushdowns, or the performance expectations of a modern lakehouse. It is the first time Python can behave not just as the control plane for data systems, but as the data system itself.

If you want to see what this looks like in practice, the full example pipeline is available in our open-source repo: https://github.com/BauplanLabs/wap_with_bauplan_and_prefect

And for production integration, the Prefect–Bauplan guide walks through configuration, deployment, and best practices: https://docs.bauplanlabs.com/integrations/orchestrators/prefect

Share on

More From Our Blog

Love Python and Go development, serverless runtimes, data lakes and Apache Iceberg, and superb DevEx? We do too! Subscribe to our newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.