From Postgres to your first data lakehouse
Most companies start their data journey with a transactional database like Postgres, collecting data from their applications and serving it back to users. But before long, new needs emerge: embedded analytics inside the product, dashboards for business teams, data pipelines for reporting, or preparing features for AI models. The emerging standard for these workloads is the Data Lakehouse — object storage paired with efficient, scalable compute — which delivers the flexibility and cost profile modern teams need.
The challenge is that Lakehouses are on average quite hard to set up. Standing one up means wrangling multiple tools — ingestion frameworks, table formats, catalogs, orchestrators and distributed runtimes. For a startup with a lean team, that’s a lot of moving parts and engineering effort. Many companies decide they “aren’t ready” for a lakehouse yet, believing it’s only for mature data teams with big budgets and complex pipelines.
That’s not true. A lakehouse is not just for advanced use cases. It’s the best foundation for analytics, BI, and AI workloads from day one and has the virtue of growing with the needs of your team.
Bauplan is the simplest way to get a Lakehouse up and running. If you have your data on Postgres and you feel it is time to do more analytical workloads and data pipelines, it is a good time to implement Change Data Capture (CDC) from your Postgres to your first Lakehouse.
From Postgres to your first Lakehouse
Postgres is a fantastic transactional database, but it is not a data platform. Once you start building data pipelines, analytical (OLAP) workloads, and AI features, it is a good idea to graduate from Postgres and start thinking about a Data Lakehouse.
- Scale and cost. Analytical reads and feature engineering scan large ranges, join across big tables, and require historical context. That punishes your OLTP database and drives costs up as soon as volume grows. A data lakehouse on object storage gives you low-cost, virtually unlimited capacity, so you can retain history and run heavy analytics without stressing Postgres.
- Analytics and ML are column-oriented; Postgres is not. Columnar workloads (long scans, aggregations, selective projections) are faster and cheaper when data is stored in a columnar format and processed with vectorized engines. Keep Postgres as your system of record, and land analytics in S3 as Apache Iceberg tables. You read only the columns you need, benefit from columnar compression, and unlock engines that execute in vectors.
- Separation of storage and compute. In Postgres, storage and compute are tied to one database server. With a lakehouse, data sits once in S3 and multiple compute engines attach on demand. You can scale up for backfills, spin down when idle, isolate BI from ETL, and run ML feature jobs without impacting transactions. This gives you elastic performance, clear cost control, and safer concurrency.
- Interoperability and openness. A solid data platform should support Python, SQL, BI, machine learning, and AI as you grow. A lakehouse based on Apache Iceberg lets multiple engines operate on the same governed tables. Even if you do not need everything on day one, moving data to object storage future-proofs the stack by avoiding warehouse lock-in and keeping data in open formats. A lakehouse can start small and grow with you.
- Immutability and replayability. In Analytics and ML you often need to reprocess history: backfills after logic changes, feature recomputation for model re-training, and audit scenarios. You must be able to re-run pipelines against a fixed snapshot of inputs and compare results without corrupting production tables. Postgres mutates data in place and relies on VACUUM, which makes exact historical replays brittle, replicas included. Lakehouses are better suited for this because data is stored as immutable files on object storage, and Iceberg tables provide with ACID snapshots to travel back in time.
So, why not Postgres everything? There you have it: using Postgres for analytics creates lock contention, noisy neighbors, long-running queries, and risky replicas. There’s no native time travel for analytical debugging, no table versioning for experiments, and CDC becomes both your source and your sink. Keep Postgres clean as the source of truth. Land analytics in S3 as Iceberg tables, then build pipelines, BI, and AI on top.
The simplest path to the Lakehouse
Day one: full Snapshots in S3
If your Postgres instance is modest in size (≤20 GB) and you don’t need low-latency replication, skip the complexity of CDC.
A good option for that is to use dltHub to take periodic full snapshots: configure Postgres and S3 credentials, then run a pipeline that copies your tables into S3 as Parquet files.
- dltHub supports both single-table extraction and full-schema export, so you can decide whether to replicate everything or just what you need.
- Once the files land in S3, use Bauplan to register them as Iceberg tables with a simple
create_table
andimport_data
step. - You now have a reproducible, lakehouse-ready foundation — with schema evolution, branching, and versioning — without the operational overhead of CDC.
Day two: Change Data Capture (CDC)
The goal is to continuously mirror Postgres changes into Iceberg tables on S3 with correctness and low toil. In plain terms, CDC captures row-level changes from Postgres and commits them into Iceberg so analytics engines can read fresh, governed tables.
- Operability: fully managed vs self-hosted open source. If you want minimal ops and are cloud-agnostic, use Fivetran Managed Data Lakes or Estuary Flow, both land directly in Iceberg.
- Freshness: near real time (seconds to minutes) or micro-batch (every few minutes). For sub-minute latency and streaming transforms, use Debezium to Kafka with the Iceberg Sink; add Flink or RisingWave for SQL transforms in flight. For micro-batches, use Airbyte Postgres to S3 Data Lake (Iceberg). If you already run Flink or prefer SQL over streams, use Flink CDC to Iceberg sink. RisingWave also offers a simpler SQL-first streaming DB with an Iceberg sink.
- Transform timing: transform before landing (stream SQL) or land raw and transform in pipelines. Managed streaming tools like Estuary can merge updates as they arrive; OSS stacks often land change events first, then upsert/compact in Iceberg.
- Iceberg catalog: choose one and standardize. Iceberg REST maximizes engine portability; AWS Glue integrates tightly with AWS; Nessie gives Git-like branches/tags; Unity Catalog can interoperate with Iceberg via the Iceberg REST catalog. If Bauplan is your landing catalog, remember it is based off Nessie. Prefer writers that can target Bauplan’s Nessie endpoint (ideally a staging branch); if your tool cannot, land ordered CDC files in S3 and let Bauplan import and merge on a branch, then publish atomically.
- Guarantees: aim for exactly-once or idempotent upserts, ordered events per key, and schema evolution. The Iceberg Kafka Connect sink provides exactly-once with centralized commit coordination; Iceberg handles schema evolution and time-travel, and you should define identifier fields plus an ordering token (e.g., LSN) and set compaction to avoid small files.
If you want the simplest path to a Lakehouse, start with Bauplan. It is a single platform that combines data versioning, pipeline orchestration, and scalable compute over object storage in pure Python and SQL.
Bauplan provides a few clear APIs — branch, query, run, commit, merge — and a declarative way to define pipelines in Python and SQL, then execute directly on S3 with infastructure management and no Spark clusters.
For the CDC there are three main options:
- Land ordered CDC event files in S3 as Parquet, then let Bauplan import and upsert into Iceberg on a branch and publish atomically. Bauplan supports the Iceberg REST catalog interface and branch-aware operations, so once a writer can speak REST it can land into a Bauplan branch cleanly.
- Register files managed by an external ingestion catalog (e.g. Risingwave, TableFlow) in Bauplan as a read-only table. In multi-catalog setups this is sometimes unavoidable, but it allows Bauplan to still reference the data consistently.
- Use Bauplan’s Iceberg catalog as the landing catalog. To avoid duplicating catalogs for this use case, writers that can point directly to Bauplan’s Iceberg REST endpoint are preferable. Bauplan’s Iceberg catalog is based on Nessie.
Some options
1) RisingWave (stream SQL + low ops)
When you want streaming transforms with SQL before data lands.
- Set up Postgres CDC into RisingWave and define the materialized views you need in SQL. When you are ready to land results, create an Iceberg sink that writes to an Iceberg catalog managed by RisingWave.
- Expose those tables to Bauplan by registering the RisingWave‑managed Iceberg tables as external tables.
In this integration, you run two Iceberg catalogs: one owned by RisingWave (writer) and one owned by Bauplan (reader/curation). RisingWave produces Iceberg tables in its own catalog, Bauplan registers those tables as external and uses them downstream. This keeps commit and compaction responsibility with RisingWave. You can configure intervals and snapshot expiration. If you choose Amazon S3 Tables as the catalog, you also get catalog‑side automatic compaction.
2) Airbyte (fully open source, fastest setup)
When micro-batch freshness is fine and you want a quick open-source start.
- Use Airbyte’s Postgres source.
- Set the destination to S3 with Parquet output.
- Land each stream into its own S3 prefix.
- In Bauplan, point
create_table
/import_data
at those Parquet paths on a staging branch, run expectations, then merge tomain
.
Your first data lakehouse is closer than you think
Moving from Postgres to a lakehouse doesn’t have to be a multi-month infrastructure project. With Bauplan, you can stand up a production-ready Iceberg lakehouse in days, not weeks — using the tools and patterns outlined in this guide. You’ll get a single place to land, version, test, and publish your data, with the safety of Git-style branches and the flexibility of open formats.
Whether you start with RisingWave, Airbyte, or Kafka, Bauplan makes the lakehouse simple, safe, and fast to run. No warehouse lock-in, no fragile glue code, just a clean foundation for analytics, BI, and AI.
Try Bauplan for free or book a call with our team to plan your first Postgres-to-Lakehouse migration.