FixEval: Execution-based Evaluation of Program Fixes for Programming
Problems
- URL: http://arxiv.org/abs/2206.07796v4
- Date: Thu, 30 Mar 2023 14:30:46 GMT
- Title: FixEval: Execution-based Evaluation of Program Fixes for Programming
Problems
- Authors: Md Mahim Anjum Haque and Wasi Uddin Ahmad and Ismini Lourentzou and
Chris Brown
- Abstract summary: We introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes.
FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes.
Our experiments show that match-based metrics do not reflect model-generated program fixes accurately.
- Score: 23.987104440395576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The complexity of modern software has led to a drastic increase in the time
and cost associated with detecting and rectifying software bugs. In response,
researchers have explored various methods to automatically generate fixes for
buggy code. However, due to the large combinatorial space of possible fixes for
any given bug, few tools and datasets are available to evaluate model-generated
fixes effectively. To address this issue, we introduce FixEval, a benchmark
comprising of buggy code submissions to competitive programming problems and
their corresponding fixes. FixEval offers an extensive collection of unit tests
to evaluate the correctness of model-generated program fixes and assess further
information regarding time, memory constraints, and acceptance based on a
verdict. We consider two Transformer language models pretrained on programming
languages as our baseline and compare them using match-based and
execution-based evaluation metrics. Our experiments show that match-based
metrics do not reflect model-generated program fixes accurately. At the same
time, execution-based methods evaluate programs through all cases and scenarios
designed explicitly for that solution. Therefore, we believe FixEval provides a
step towards real-world automatic bug fixing and model-generated code
evaluation. The dataset and models are open-sourced at
https://github.com/mahimanzum/FixEval.
Related papers
- RepoMasterEval: Evaluating Code Completion via Real-World Repositories [12.176098357240095]
RepoMasterEval is a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories.
To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases.
Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark.
arXiv Detail & Related papers (2024-08-07T03:06:57Z) - Benchmarking Educational Program Repair [4.981275578987307]
Large language models (LLMs) can be used to generate learning resources, improve error messages, and provide feedback on code.
There is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches.
In this article, we propose a novel educational program repair benchmark.
arXiv Detail & Related papers (2024-05-08T18:23:59Z) - A Deep Dive into Large Language Models for Automated Bug Localization and Repair [12.756202755547024]
Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR)
In this study, we take a deep dive into automated bug fixing utilizing LLMs.
This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information.
Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark.
arXiv Detail & Related papers (2024-04-17T17:48:18Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Fully Autonomous Programming with Large Language Models [0.9558392439655015]
Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome"
We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation.
The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
arXiv Detail & Related papers (2023-04-20T16:12:05Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.