Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
- URL: http://arxiv.org/abs/2305.01210v3
- Date: Mon, 30 Oct 2023 19:37:09 GMT
- Title: Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
- Authors: Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang
- Abstract summary: We propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.
EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator.
We show that HumanEval+ is able to catch significant amounts of previously undetected wrong code.
- Score: 20.45045253933097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Program synthesis has been long studied with recent approaches focused on
directly using the power of Large Language Models (LLMs) to generate code.
Programming benchmarks, with curated synthesis problems and test-cases, are
used to measure the performance of various LLMs on code synthesis. However,
these test-cases can be limited in both quantity and quality for fully
assessing the functional correctness of the generated code. Such limitation in
the existing benchmarks begs the following question: In the era of LLMs, is the
code generated really correct? To answer this, we propose EvalPlus -- a code
synthesis evaluation framework to rigorously benchmark the functional
correctness of LLM-synthesized code. EvalPlus augments a given evaluation
dataset with large amounts of test-cases newly produced by an automatic test
input generator, powered by both LLM- and mutation-based strategies. While
EvalPlus is general, we extend the test-cases of the popular HumanEval
benchmark by 80x to build HumanEval+. Our extensive evaluation across 26
popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to
catch significant amounts of previously undetected wrong code synthesized by
LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that
test insufficiency can lead to mis-ranking. For example, both
WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+,
while none of them could on HumanEval. Our work not only indicates that prior
popular code synthesis evaluation results do not accurately reflect the true
performance of LLMs for code synthesis, but also opens up a new direction to
improve such programming benchmarks through automated testing. We have
open-sourced our tools, enhanced datasets as well as all LLM-generated code at
https://github.com/evalplus/evalplus to facilitate and accelerate future
LLM-for-code research.
Related papers
- Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering [18.766132076075365]
Large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation.
Pass@k metric requires extensive unit tests and configured environments, and is not suitable for evaluating LLM-generated text.
Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny.
arXiv Detail & Related papers (2025-02-10T06:49:29Z) - Evaluating and Aligning CodeLLMs on Human Preference [42.26173776584043]
We present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks.
We also propose a diverse synthetic instruction corpus SynCode-Instruct to verify the effectiveness of the large-scale synthetic instruction fine-tuning.
The results find performance differences between execution-based benchmarks and CodeArena.
arXiv Detail & Related papers (2024-12-06T17:40:38Z) - AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [39.13045037676502]
Development of large language models (LLMs) has significantly pushed the frontiers of program synthesis.
Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations.
We develop ENAMEL, a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code.
arXiv Detail & Related papers (2024-06-10T04:19:20Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants.
Our results demonstrate a significant improvement over existing SOTA synthetic content detectors.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle
Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems.
We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness.
Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.