KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
- URL: http://arxiv.org/abs/2506.06541v1
- Date: Fri, 06 Jun 2025 21:18:45 GMT
- Title: KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
- Authors: Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Sivaprasad Sudhir, Om Chabra, Anna Zeng, Anton A. Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, Jun Chen, Markos Markakis, Matthew Russo, Weiyang Wang, Ziniu Wu, Michael J. Cafarella, Lei Cao, Samuel Madden, Tim Kraska,
- Abstract summary: We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines.<n>We show that these pipelines test the end-to-end capabilities of AI systems on data processing.<n>Our results show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, existing out-of-box models fall short.
- Score: 20.75018548918123
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.
Related papers
- AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs [24.511434195711267]
AQuilt is a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data.<n>We construct a dataset of 703k examples to train a powerful data synthesis model.
arXiv Detail & Related papers (2025-07-24T17:03:27Z) - Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems [8.816332263275305]
Traditional Data+AI systems rely heavily on human experts to orchestrate system pipelines.<n>Existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning.<n>We propose the concept of a 'Data Agent' - a comprehensive architecture designed to orchestrate Data+AI ecosystems.
arXiv Detail & Related papers (2025-07-02T11:04:49Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.<n>This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.<n>Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie [5.259526087073711]
We introduce a system designed to decouple compute from data management, by leveraging a cloud runtime alongside Nessie.
We demonstrate its ability to offer time-travel and branching semantics on top of object storage, and offer full pipeline with a few CLI commands.
arXiv Detail & Related papers (2024-04-21T14:53:33Z) - In-depth Analysis On Parallel Processing Patterns for High-Performance
Dataframes [0.0]
We present a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon.
In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns.
We evaluate the performance of Cylon on the ORNL Summit supercomputer.
arXiv Detail & Related papers (2023-07-03T23:11:03Z) - KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science [4.120803087965204]
This paper presents a scalable platform, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections.
Based on this information, KGLiDS enables various downstream applications, such as data discovery and pipeline automation.
arXiv Detail & Related papers (2023-03-03T20:31:04Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.