Execution-based Evaluation for Data Science Code Generation Models
- URL: http://arxiv.org/abs/2211.09374v1
- Date: Thu, 17 Nov 2022 07:04:11 GMT
- Title: Execution-based Evaluation for Data Science Code Generation Models
- Authors: Junjie Huang, Chenglong Wang, Jipeng Zhang, Cong Yan, Haotian Cui,
Jeevana Priya Inala, Colin Clement, Nan Duan, Jianfeng Gao
- Abstract summary: We introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks.
ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and desired execution output.
We evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores.
- Score: 97.96608263010913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation models can benefit data scientists' productivity by
automatically generating code from context and text descriptions. An important
measure of the modeling progress is whether a model can generate code that can
correctly execute to solve the task. However, due to the lack of an evaluation
dataset that directly supports execution-based model evaluation, existing work
relies on code surface form similarity metrics (e.g., BLEU, CodeBLEU) for model
selection, which can be inaccurate.
To remedy this, we introduce ExeDS, an evaluation dataset for execution
evaluation for data science code generation tasks. ExeDS contains a set of 534
problems from Jupyter Notebooks, each consisting of code context, task
description, reference program, and the desired execution output. With ExeDS,
we evaluate the execution performance of five state-of-the-art code generation
models that have achieved high surface-form evaluation scores. Our experiments
show that models with high surface-form scores do not necessarily perform well
on execution metrics, and execution-based metrics can better capture model code
generation errors. Source code and data can be found at
https://github.com/Jun-jie-Huang/ExeDS
Related papers
- Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists [41.94295877935867]
We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science.
We demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs.
arXiv Detail & Related papers (2024-10-30T17:59:01Z) - DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models [36.266383541354294]
This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks.
Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks.
Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers.
arXiv Detail & Related papers (2024-10-09T18:00:05Z) - RepoMasterEval: Evaluating Code Completion via Real-World Repositories [12.176098357240095]
RepoMasterEval is a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories.
To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases.
Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark.
arXiv Detail & Related papers (2024-08-07T03:06:57Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - Incorporating Domain Knowledge through Task Augmentation for Front-End
JavaScript Code Generation [10.75138604869187]
In some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data.
We propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model.
Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset.
arXiv Detail & Related papers (2022-08-22T06:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.