Related papers: ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

URL: http://arxiv.org/abs/2504.04808v2
Date: Mon, 14 Apr 2025 19:46:56 GMT
Title: ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines
Authors: Tengjun Jin, Yuxuan Zhu, Daniel Kang,
Abstract summary: We introduce ELT-Bench, an end-to-end benchmark to assess the capabilities of AI agents to build Extract-Load-Transform pipelines.<n>ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains.<n>We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench.
Score: 4.556817293680431
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents' abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of $4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc-kang-lab/ELT-Bench.

Related papers

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI [42.191938707504406]
DataFlow is a unified and LLM-driven data preparation framework.<n>System-level abstractions enable modular, reusable, and composable data transformations.<n>DataFlow consistently improves downstream Large Language Models performance.
arXiv Detail & Related papers (2025-12-18T15:46:15Z)
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle [41.576417987200074]
Real-world enterprise data intelligence encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights.<n>We introduce DAComp, a benchmark of 210 tasks that mirrors these complex capabilities.
arXiv Detail & Related papers (2025-12-03T23:21:28Z)
Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents [85.02904078131682]
We introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets.<n> ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic.<n>All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
arXiv Detail & Related papers (2025-10-28T17:53:13Z)
FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering [1.3599496385950987]
FlowETL is an example-based autonomous pipeline architecture designed to automatically standardise and prepare input datasets.<n>A Planning Engine uses a paired input-output datasets sample to construct a transformation plan, which is then applied by an worker to the source.<n>The results show promising generalisation capabilities across 14 datasets of various domains, file structures, and file sizes.
arXiv Detail & Related papers (2025-07-30T21:46:22Z)
KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes [20.75018548918123]
We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines.<n>We show that these pipelines test the end-to-end capabilities of AI systems on data processing.<n>Our results show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, existing out-of-box models fall short.
arXiv Detail & Related papers (2025-06-06T21:18:45Z)
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z)
Text embedding models can be great data engineers [0.0]
We propose ADEPT, an automated data engineering pipeline via text embeddings.<n>We show that ADEPT outperforms the best existing benchmarks in a diverse set of datasets.
arXiv Detail & Related papers (2025-05-20T18:12:19Z)
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines [0.8148009849453334]
Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. In this paper, we introduce PROMPTEVALS, a dataset of 2087 pipeline prompts with 12623 corresponding assertion criteria.
arXiv Detail & Related papers (2025-04-20T21:04:23Z)
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage [75.76940471949366]
We propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data.<n>To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories.<n> Evaluations show that the T3-Agent consistently achieves improvements on two popular VLMs.
arXiv Detail & Related papers (2024-12-20T07:00:46Z)
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration. It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories. LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z)
ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans [3.2362171533623054]
We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their Machine Learning pipelines. We extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code.
arXiv Detail & Related papers (2024-07-10T11:35:02Z)
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning [98.26836657967162]
textbfAgentOhana aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. textbfxLAM-v0.1, a large action model tailored for AI agents, demonstrates exceptional performance across various benchmarks.
arXiv Detail & Related papers (2024-02-23T18:56:26Z)
Accelerated Cloud for Artificial Intelligence (ACAI) [24.40451195277244]
We propose an end-to-end cloud-based machine learning platform, Accelerated Cloud for AI (ACAI) ACAI enables cloud-based storage of indexed, labeled, and searchable data, as well as automatic resource provisioning, job scheduling, and experiment tracking. We show that our auto-provisioner produces a 1.7x speed-up and 39% cost reduction, and our system reduces experiment time for ML scientists by 20% on typical ML use cases.
arXiv Detail & Related papers (2024-01-30T07:09:48Z)
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks [84.7788065721689]
In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files. Building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench.
arXiv Detail & Related papers (2024-01-10T19:04:00Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines [41.39496264168388]
Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time. We propose Auto-by-History (AVH) that can automatically detect DQ issues in recurring pipelines.
arXiv Detail & Related papers (2023-06-04T17:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.