Data Engineering and Automation in the Era of Agents

How Git-for-data branching and unified APIs enable AI agents to safely automate the repetitive mechanics of data engineering

Ciro Greco

September 15, 2025

“Besides black art, there is only automation and mechanization.” — Federico García Lorca

Modern data platforms are standing at a threshold. On one side, the promise of AI woven into every application, delivered at the pace of software. On the other, the daily work of data engineers dealing with the fragility of data pipelines.

Data engineers have always relied on automation (that is the whole point of pipelines), but data pipelines remain fragile in practice. Data infrastructure is very fragmented, each new tool introduces another interface to learn, and when something inevitably fails, it takes a fair amount of annoying work bring things back on track.

AI agents represent a different kind of automation. They can explore alternatives, adjust when conditions change, and attempt several solutions rather than simply crash.

AI agents create the possibility of offloading the repetitive mechanics that eat so much of a data engineering team’s time. But for that to work, our platforms have to change. Current data systems were built for people at keyboards, not for humans and agents working side by side, not for programmatic trust.

What’s worth automating

As a data engineer, this is my Monday at 9:00 am. Several jobs ran on a schedule at different times, some of them failed. My job now is to figure out what happened and restore all the red lights to green.

In principle, failures can occur for different reasons. There could be problems with the data, with the transformation code, with the orchestration logic, with the infrastructure, or more likely, a combination of all of the above.

Fixing this requires that I:

Debug failing jobs
Re-run data pipelines
Write new data quality checks
re-import and validate tables
Chase down schema consistency

Some parts of the job demand imagination: modeling business concepts, setting standards, deciding what counts as truth. Some parts, do not. They demand time to do repetitive, mechanical work. The latter should be automated by AI agents.

Data Engs --> Agentic Lakehouse <-- Software Engs

Rethinking your data platform

If agents are to become first-class actors in the stack, the design center of the platform must shift. Notebooks and GUI-driven workflows are terribly ill-equipped to deal with autonomous systems that are executing complex operations in our data infrastructure. Agents cannot context switch between SQL consoles, orchestration dashboards, and infrastructure scripts. They need unified and programmable entry points into the platform.

They also cannot operate safely in the fuzzy edges of staging or local environments. If we want the agents to help with real work, they must work with production data and manipulate our infrastructure. However, it is a very well known fact that agents cannot be trusted. So we will need environments that are completely isolated, where every change can be audited, and every action reversed.

In other words, we need to rethink our data platforms in order to adopt semantics that look a lot more like software version control: branching, atomic commits, rollbacks, and controlled merges.

Two requirements rise to the surface:

1. Unified, API-first control of data and infrastructure

Most “AI for data” tools today stop at the query layer. You can chat with your tables, explore metrics, maybe even generate SQL. That’s useful, but it doesn’t cover the real work of a data engineer. Exploration is necessary, but not sufficient.

Consider the examples above: when pipelines fail, I don’t know in advance if the problem lies in the data, the transformation code, or the underlying infrastructure. Often it’s a mix of all three. Fixing it means not just asking questions of the data, but also modifying schemas, adjusting pipelines, or provisioning resources.

Agents can’t do this if each layer lives behind a different interface. They need a unified, programmable API that spans data, pipelines, environments, and infrastructure — a single language for the whole system.

‍

2. Isolated execution with real production data

The second problem is safety. Agents are only valuable if they can work with production data, but they must do so without ever putting production at risk. That’s where Git-for-Data comes in.

Every operation happens inside an isolated branch — a zero-copy environment that mirrors production but is safe to experiment with.
Each change is tracked as a commit, atomic across multiple tables, and every branch maintains a complete history of the transformations applied.
Integration happens through controlled merges, and if something goes wrong, rollbacks return the system instantly to a previous state.
Even compute is isolated: functions run with declarative packages, fully reproducible and independent from one another.

These primitives — branches, commits, history, merges, rollbacks, and isolated compute — are what make it possible for agents to iterate freely while keeping production intact.

Security = Isolation, Auditability and Rollbacks

Towards an agentic architecture

Building a data platform that supports both human operators and AI agents requires rethinking traditional lakehouse architecture. Instead of optimizing for notebook-driven exploration and GUI-based management, you need infrastructure that treats data operations as programmatic, versioned, and automatable from the ground up.

This means:

Everything-as-Code: Data pipelines, infrastructure provisioning, data quality rules, and access controls should all be declarative and version-controlled. Agents need to understand and modify the complete data system state, not just individual components.
Branching-Native Architecture: The platform should support isolated data environments as a first-class concept, with zero-copy branching, atomic commits across multiple data assets, and controlled merging processes.
Unified API Surface: Rather than separate interfaces for data access, pipeline management, and infrastructure control, agents need consistent programmatic access to all platform capabilities.
Built-in Observability: Agents need comprehensive instrumentation to understand system state, debug issues, and make informed decisions about infrastructure changes.

Organizations that build data platforms designed for agentic workflows will gain significant operational advantages. While competitors struggle with the growing complexity of manual data operations, teams with agent-capable platforms will iterate at software development speeds.

The question for data platform teams isn't whether agentic workflows are coming, they're already here for early adopters. The question is whether your platform architecture will support this evolution or require fundamental rebuilds as agent capabilities advance.

The future of data platforms lies in architectures that treat automation as a first-class design principle, not an afterthought. Build for the agents, and you'll unlock capabilities that benefit both human operators and autonomous systems.

‍

Share on

Data Engineering and Automation in the Era of Agents

What’s worth automating

Rethinking your data platform

1. Unified, API-first control of data and infrastructure

2. Isolated execution with real production data

Towards an agentic architecture

This means:

More From Our Blog

Shift left - Bauplan & Trust & Will

Git-for-data: Formal Semantics Part 2: Branching, Merging, and Rollbacks

Your First Lakehouse