Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
- URL: http://arxiv.org/abs/2305.01210v3
- Date: Mon, 30 Oct 2023 19:37:09 GMT
- Title: Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
- Authors: Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang
- Abstract summary: We propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.
EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator.
We show that HumanEval+ is able to catch significant amounts of previously undetected wrong code.
- Score: 20.45045253933097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Program synthesis has been long studied with recent approaches focused on
directly using the power of Large Language Models (LLMs) to generate code.
Programming benchmarks, with curated synthesis problems and test-cases, are
used to measure the performance of various LLMs on code synthesis. However,
these test-cases can be limited in both quantity and quality for fully
assessing the functional correctness of the generated code. Such limitation in
the existing benchmarks begs the following question: In the era of LLMs, is the
code generated really correct? To answer this, we propose EvalPlus -- a code
synthesis evaluation framework to rigorously benchmark the functional
correctness of LLM-synthesized code. EvalPlus augments a given evaluation
dataset with large amounts of test-cases newly produced by an automatic test
input generator, powered by both LLM- and mutation-based strategies. While
EvalPlus is general, we extend the test-cases of the popular HumanEval
benchmark by 80x to build HumanEval+. Our extensive evaluation across 26
popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to
catch significant amounts of previously undetected wrong code synthesized by
LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that
test insufficiency can lead to mis-ranking. For example, both
WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+,
while none of them could on HumanEval. Our work not only indicates that prior
popular code synthesis evaluation results do not accurately reflect the true
performance of LLMs for code synthesis, but also opens up a new direction to
improve such programming benchmarks through automated testing. We have
open-sourced our tools, enhanced datasets as well as all LLM-generated code at
https://github.com/evalplus/evalplus to facilitate and accelerate future
LLM-for-code research.
Related papers
- AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [39.13045037676502]
Development of large language models (LLMs) has significantly pushed the frontiers of program synthesis.
Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations.
We develop ENAMEL, a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code.
arXiv Detail & Related papers (2024-06-10T04:19:20Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities.
Code reasoning is one of the most essential abilities of code LLMs.
We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - Mercury: A Code Efficiency Benchmark for Code Large Language Models [41.51235610016959]
We present Mercury, the first code efficiency benchmark for Large Language Models for Code (Code LLMs)
It comprises 1,889 Python tasks, each accompanied by adequate solutions that serve as real-world efficiency baselines.
We introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and code efficiency simultaneously.
arXiv Detail & Related papers (2024-02-12T17:53:22Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - Coarse-Tuning Models of Code with Reinforcement Learning Feedback [0.0]
Large Language Models (LLMs) pre-trained on code have emerged as the dominant approach to program synthesis.
We propose RLCF, that further trains a pre-trained LLM via reinforcement learning, using feedback from a grounding function that scores the quality of the code.
arXiv Detail & Related papers (2023-05-25T22:09:08Z) - ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle
Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems.
We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness.
Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.