CodeJudge: Evaluating Code Generation with Large Language Models
- URL: http://arxiv.org/abs/2410.02184v1
- Date: Thu, 3 Oct 2024 03:58:03 GMT
- Title: CodeJudge: Evaluating Code Generation with Large Language Models
- Authors: Weixi Tong, Tianyi Zhang,
- Abstract summary: Large Language Models (LLMs) have shown promising performance in code generation.
How to reliably evaluate code generated by LLMs remains an unresolved problem.
This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases.
- Score: 6.867043179943195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing "slow thinking" to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub https://github.com/VichyTong/CodeJudge.
Related papers
- Steering Large Language Models between Code Execution and Textual Reasoning [22.279107036500083]
Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching.
The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution.
We propose three methods to better steer LLM code/text generation and achieve a notable improvement.
arXiv Detail & Related papers (2024-10-04T15:44:47Z) - Beyond Code Generation: Assessing Code LLM Maturity with Postconditions [9.521621889147362]
We propose a code Large Language Model maturity model based on the postcondition generation problem.
We augment the EvalPlus dataset to a postcondition testing benchmark, and evaluate several open-sourced models.
arXiv Detail & Related papers (2024-07-19T08:34:30Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Exploring the Capabilities of LLMs for Code Change Related Tasks [14.261870410238643]
Large language models (LLMs) have shown their effectiveness in code-related tasks.
LLMs focus on general code syntax and semantics rather than the differences between two code versions.
We conduct an empirical study using textgreater 1B parameters LLMs on three code-change-related tasks.
arXiv Detail & Related papers (2024-07-03T05:49:18Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities.
Code reasoning is one of the most essential abilities of code LLMs.
We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z) - StarCoder 2 and The Stack v2: The Next Generation [105.93298676368798]
We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens.
We thoroughly evaluate them on a comprehensive set of Code LLM benchmarks.
Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size.
arXiv Detail & Related papers (2024-02-29T13:53:35Z) - Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - Coarse-Tuning Models of Code with Reinforcement Learning Feedback [0.0]
Large Language Models (LLMs) pre-trained on code have emerged as the dominant approach to program synthesis.
We propose RLCF, that further trains a pre-trained LLM via reinforcement learning, using feedback from a grounding function that scores the quality of the code.
arXiv Detail & Related papers (2023-05-25T22:09:08Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.