Execution-based Evaluation for Data Science Code Generation Models
- URL: http://arxiv.org/abs/2211.09374v1
- Date: Thu, 17 Nov 2022 07:04:11 GMT
- Title: Execution-based Evaluation for Data Science Code Generation Models
- Authors: Junjie Huang, Chenglong Wang, Jipeng Zhang, Cong Yan, Haotian Cui,
Jeevana Priya Inala, Colin Clement, Nan Duan, Jianfeng Gao
- Abstract summary: We introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks.
ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and desired execution output.
We evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores.
- Score: 97.96608263010913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation models can benefit data scientists' productivity by
automatically generating code from context and text descriptions. An important
measure of the modeling progress is whether a model can generate code that can
correctly execute to solve the task. However, due to the lack of an evaluation
dataset that directly supports execution-based model evaluation, existing work
relies on code surface form similarity metrics (e.g., BLEU, CodeBLEU) for model
selection, which can be inaccurate.
To remedy this, we introduce ExeDS, an evaluation dataset for execution
evaluation for data science code generation tasks. ExeDS contains a set of 534
problems from Jupyter Notebooks, each consisting of code context, task
description, reference program, and the desired execution output. With ExeDS,
we evaluate the execution performance of five state-of-the-art code generation
models that have achieved high surface-form evaluation scores. Our experiments
show that models with high surface-form scores do not necessarily perform well
on execution metrics, and execution-based metrics can better capture model code
generation errors. Source code and data can be found at
https://github.com/Jun-jie-Huang/ExeDS
Related papers
- UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - GenX: Mastering Code and Test Generation with Execution Feedback [7.225594526057816]
We propose a novel approach that concurrently trains a code generation model and a test generation model.
We introduce two strategies for test and code data augmentation and a new scoring function for code and test ranking.
The results demonstrate that our models, when iteratively trained with an increasing number of test cases and code solutions, outperform those trained on the original dataset.
arXiv Detail & Related papers (2024-12-18T03:18:21Z) - Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists [41.94295877935867]
We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science.
We demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs.
arXiv Detail & Related papers (2024-10-30T17:59:01Z) - RepoMasterEval: Evaluating Code Completion via Real-World Repositories [12.176098357240095]
RepoMasterEval is a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories.
To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases.
Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark.
arXiv Detail & Related papers (2024-08-07T03:06:57Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.