Related papers: Benchmarking AI Performance on End-to-End Data Science Projects

Benchmarking AI Performance on End-to-End Data Science Projects

URL: http://arxiv.org/abs/2602.14284v1
Date: Sun, 15 Feb 2026 19:16:04 GMT
Title: Benchmarking AI Performance on End-to-End Data Science Projects
Authors: Evelyn Hughes, Rohan Alexander,
Abstract summary: We create a benchmark of 40 end-to-end data science projects with associated rubric evaluations.<n>We use these to build an automated grading pipeline that systematically evaluates the data science projects produced by generative AI models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data science is an integrated workflow of technical, analytical, communication, and ethical skills, but current AI benchmarks focus mostly on constituent parts. We test whether AI models can generate end-to-end data science projects. To do this we create a benchmark of 40 end-to-end data science projects with associated rubric evaluations. We use these to build an automated grading pipeline that systematically evaluates the data science projects produced by generative AI models. We find the extent to which generative AI models can complete end-to-end data science projects varies considerably by model. Most recent models did well on structured tasks, but there were considerable differences on tasks that needed judgment. These findings suggest that while AI models could approximate entry-level data scientists on routine tasks, they require verification.

Related papers

AI, Humans, and Data Science: Optimizing Roles Across Workflows and the Workforce [0.0]
We consider the potential and limitation of analytic, generative, and agentic AI to augment data scientists or take on tasks traditionally done by human analysts and researchers.<n>Just as earlier eras of survey analysis created issues when the increased ease of using statistical software allowed researchers to conduct analyses they did not fully understand, the new AI tools may create similar but larger risks.
arXiv Detail & Related papers (2025-07-15T17:59:06Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.<n>This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.<n>Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z)
Survey and Taxonomy: The Role of Data-Centric AI in Transformer-Based Time Series Forecasting [36.31269406067809]
We argue that data-centric AI is essential for training AI models, particularly for transformer-based TSF models efficiently. We review the previous research works from a data-centric AI perspective and we intend to lay the foundation work for the future development of transformer-based architecture and data-centric AI.
arXiv Detail & Related papers (2024-07-29T08:27:21Z)
What About the Data? A Mapping Study on Data Engineering for AI Systems [0.0]
There is a growing need for data engineers that know how to prepare data for AI systems. We found 25 relevant papers between January 2019 and June 2023, explaining AI data engineering activities. This paper creates an overview of the body of knowledge on data engineering for AI.
arXiv Detail & Related papers (2024-02-07T16:31:58Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
AI-Generated Images as Data Source: The Dawn of Synthetic Era [61.879821573066216]
generative AI has unlocked the potential to create synthetic images that closely resemble real-world photographs. This paper explores the innovative concept of harnessing these AI-generated images as new data sources. In contrast to real data, AI-generated data exhibit remarkable advantages, including unmatched abundance and scalability.
arXiv Detail & Related papers (2023-10-03T06:55:19Z)
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs [71.7495056818522]
We introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. We will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.
arXiv Detail & Related papers (2023-03-29T03:30:38Z)
Data-centric Artificial Intelligence: A Survey [47.24049907785989]
Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle.
arXiv Detail & Related papers (2023-03-17T17:44:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.