Related papers: DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

URL: http://arxiv.org/abs/2410.07331v2
Date: Fri, 11 Oct 2024 00:53:05 GMT
Title: DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Authors: Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu,
Abstract summary: This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers.
Score: 36.266383541354294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.

Related papers

DataSciBench: An LLM Agent Benchmark for Data Science [33.3811507234528]
DataSciBench is a benchmark for evaluating Large Language Model (LLM) capabilities in data science. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. We propose an innovative Task - Function - Code framework to assess each code execution outcome.
arXiv Detail & Related papers (2025-02-19T17:31:51Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
Evaluation of Code LLMs on Geospatial Code Generation [1.6834474847800562]
Large Language Models (LLMs) can generate Python code for data science and machine learning applications. Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks. Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy.
arXiv Detail & Related papers (2024-10-06T20:34:03Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z)
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data [26.836532205017104]
We find that many datasets suffer from severe data leakage. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. We present XCoder, a family of models finetuned from LLaMA3.
arXiv Detail & Related papers (2024-09-05T17:46:30Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
Empirical Study on Transformer-based Techniques for Software Engineering [12.973997150227198]
We review the existing literature, examine the suitability of model architectures for different tasks, and look at the generalization ability of models on different datasets. We conduct experiments on the top-4 most targeted software engineering tasks that we found in our literature survey: Code Summarization, Bug Fixing, Bug Detection, and Code Search.
arXiv Detail & Related papers (2023-09-30T14:45:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.