Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents
- URL: http://arxiv.org/abs/2602.02335v2
- Date: Tue, 10 Feb 2026 15:46:24 GMT
- Title: Building a Correct-by-Design Lakehouse. Data Contracts, Versioning, and Transactional Pipelines for Humans and Agents
- Authors: Weiming Sheng, Jinlang Wang, Manuel Barros, Aldrin Montana, Jacopo Tagliabue, Luca Bigon,
- Abstract summary: Bauplan is a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions.<n>Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and runtime, and transactional runs that guarantee pipeline-level atomicity.
- Score: 1.9161188920101901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lakehouses are the default cloud platform for analytics and AI, but they become unsafe when untrusted actors concurrently operate on production data: upstream-downstream mismatches surface only at runtime, and multi-table pipelines can leak partial effects. Inspired by software engineering, we design Bauplan, a code-first lakehouse that aims to make (most) illegal states unrepresentable using familiar abstractions. Bauplan acts along three axes: typed table contracts to make pipeline boundaries checkable, Git-like data versioning for review and reproducibility, and transactional runs that guarantee pipeline-level atomicity. We report early results from a lightweight formal transaction model and discuss future work motivated by counterexamples.
Related papers
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL [64.8155693023222]
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.<n>This gap stems from a shortage of high-quality, action-aligned reasoning data.<n>We present GUI-Libra, a tailored training recipe that addresses these challenges.
arXiv Detail & Related papers (2026-02-25T18:34:57Z) - BlackCATT: Black-box Collusion Aware Traitor Tracing in Federated Learning [51.251962154210474]
We present a general collusion-resistant embedding method for black-box traitor tracing in Federated Learning: BlackCATT.<n> Experimental results confirm the efficacy of the proposed scheme across different architectures and datasets.<n>For models that would otherwise suffer from update incompatibility on the main task, our proposed BlackCATT+FR incorporates functional regularization.
arXiv Detail & Related papers (2026-02-12T16:26:57Z) - AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents [70.68963723787424]
TermiGen is an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories.<n>Our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench.
arXiv Detail & Related papers (2026-02-06T23:56:50Z) - Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance [5.3013727160110085]
We argue that the path to trustworthy agentic begins with solving the infrastructure problem first.<n>We propose an agent-first design, Bauplan, that re-implements data and compute isolation in the lakehouse.<n>We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan.
arXiv Detail & Related papers (2025-11-20T14:21:34Z) - Safe, Untrusted, "Proof-Carrying" AI Agents: toward the agentic lakehouse [3.6729718095918393]
API-first, programmable lakehouses provide the right abstractions for safe-by-design, agentic lakehouses.<n>We present a proof-of-concept in which agents repair data pipelines using correctness checks inspired by proof-carrying code.
arXiv Detail & Related papers (2025-10-10T17:18:36Z) - Lang-PINN: From Language to Physics-Informed Neural Networks via a Multi-Agent Framework [54.447408954009454]
Physics-informed neural networks (PINNs) provide a powerful approach for solving partial differential equations (PDEs)<n>We present Lang-PINN, an LLM-driven multi-agent system that builds trainable PINNs directly from natural language task descriptions.<n>Experiments show that Lang-PINN achieves substantially lower errors and greater robustness than competitive baselines.
arXiv Detail & Related papers (2025-10-03T08:20:02Z) - DyFlow: Dynamic Workflow Framework for Agentic Reasoning [79.19799197382478]
DyFlow is a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback.<n>We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation.<n>Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains.
arXiv Detail & Related papers (2025-09-30T10:36:23Z) - Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines [18.75611679837171]
We introduce Text-to-Pipeline, a new task that translates NL data preparation instructions into DP pipelines.<n>Parrot is a large-scale benchmark to support systematic evaluation.<n>ParROT is built by mining transformation patterns from production pipelines and instantiating them on 23,009 real-world tables.
arXiv Detail & Related papers (2025-05-21T15:40:53Z) - Bauplan: zero-copy, scale-up FaaS for data pipelines [4.6797109107617105]
bauplan is a novel F programming model and serverless runtime designed for data practitioners.
bauplan enables users to declaratively define functional Directed Acyclic Graphs (DAGs) along with their runtime environments.
We show that bauplan both better performance and a superior developer experience for data workloads by making trade-off of reducing generality in favor of data-awareness.
arXiv Detail & Related papers (2024-10-22T22:49:01Z) - Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie [5.259526087073711]
We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
arXiv Detail & Related papers (2024-04-21T14:53:33Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.28944613907541]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.<n>We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.