The Program Testing Ability of Large Language Models for Code
- URL: http://arxiv.org/abs/2310.05727v1
- Date: Mon, 9 Oct 2023 13:55:45 GMT
- Title: The Program Testing Ability of Large Language Models for Code
- Authors: Weimin Xiong, Yiwen Guo, Hao Chen
- Abstract summary: Large language models (LLMs) for code like CodeX and CodeT5+ demonstrate tremendous promise in achieving code intelligence.
We show a series of intriguing properties of these models and demonstrate how program testing ability of LLMs can be improved.
- Score: 27.590499335039972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent development of large language models (LLMs) for code like CodeX and
CodeT5+ demonstrates tremendous promise in achieving code intelligence. Their
ability of synthesizing code that completes a program for performing a
pre-defined task has been intensively tested and verified on benchmark datasets
including HumanEval and MBPP. Yet, evaluation of these LLMs from more
perspectives (than just program synthesis) is also anticipated, considering
their broad scope of applications in software engineering. In this paper, we
explore the ability of LLMs for testing programs/code. By performing thorough
analyses of recent LLMs for code in program testing, we show a series of
intriguing properties of these models and demonstrate how program testing
ability of LLMs can be improved. Following recent work which utilizes generated
test cases to enhance program synthesis, we further leverage our findings in
improving the quality of the synthesized programs and show +11.77% and +4.22%
higher code pass rates on HumanEval+ comparing with the GPT-3.5-turbo baseline
and the recent state-of-the-art, respectively.
Related papers
- Precision or Peril: Evaluating Code Quality from Quantized Large Language Models [0.5249805590164902]
Quantization has emerged as a way to mitigate the memory overhead of Large Language Models.
This study aims to evaluate the current code generation capabilities of smaller LLMs using various metrics.
arXiv Detail & Related papers (2024-11-16T01:31:29Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code [34.03774442237902]
Large Language Models applied to code-related applications have emerged as a prominent field.
Existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities.
We propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code.
arXiv Detail & Related papers (2024-03-12T17:58:04Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large
Language Models for Program Testing [27.45301385265713]
We present a large-scale dataset UniTSyn, which is capable of enhancing the prowess of LLMs for Unit Test Synthesis.
By leveraging Language Server Protocol, UniSyn achieves the challenging goal of collecting focal-test pairs without per-project execution setups or per-language setups.
Experiments demonstrate that, by building an autoregressive model based on UniTSyn, we can achieve significant benefits in learning and understanding unit test representations.
arXiv Detail & Related papers (2024-02-04T22:48:05Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - LLM4TDD: Best Practices for Test Driven Development Using Large Language
Models [0.76146285961466]
This paper explores the concept of LLM4TDD, where we guide Large Language Models to generate code iteratively using a test-driven development methodology.
We conduct an empirical evaluation using ChatGPT and coding problems from LeetCode to investigate the impact of different test, prompt and problem attributes on the efficacy of LLM4TDD.
arXiv Detail & Related papers (2023-12-07T20:37:54Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.