Launching Bauplan MCP Server: the First Step towards the Agentic Lakehouse

Launching Bauplan MCP Server

Research

Reproducible data science over data lakes: data pipelines with Bauplan and Nessie

Paper presented at SIGMOD/PODS 2024. Awarded best presentation DEEM@SIGMOD.
Ciro Greco
Jacopo Tagliabue
June 9, 2024

Abstract

As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data practitioners. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie, an open-source catalog with Git semantics. Demonstrating the system's capabilities, we showcase its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline reproducibility with a few CLI commands.


Read the full paper (released at SIGMOD/PODS 2024, awarded best presentation DEEM@SIGMOD)

Share on

More From Our Blog

Love Python and Go development, serverless runtimes, data lakes and Apache Iceberg, and superb DevEx? We do too! Subscribe to our newsletter.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.