Fixing Code Generation Errors for Large Language Models
- URL: http://arxiv.org/abs/2409.00676v1
- Date: Sun, 1 Sep 2024 09:40:15 GMT
- Title: Fixing Code Generation Errors for Large Language Models
- Authors: Hao Wen, Yueheng Zhu, Chao Liu, Xiaoxue Ren, Weiwei Du, Meng Yan,
- Abstract summary: Large Language Models (LLMs) are designed to automatically produce source code for software development.
LLMs' generated code often fails to pass test cases and requires substantial human effort to fix errors.
We propose a fixing method called LlmFix, which addresses three types of errors through a three-step process.
- Score: 6.137340149146578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code generation leverages artificial intelligence technologies, particularly Large Language Models (LLMs), to automatically produce source code, enhancing software development efficiency and reducing repetitive tasks. However, the LLMs' generated code often fails to pass test cases and requires substantial human effort to fix errors. Previous studies focused on better prompts or improving LLMs' capability but ignored why LLMs failed. In this paper, we first reproduced 14 LLMs, including GPT-3.5-turbo and 13 open-source LLMs, on the HumanEval dataset. We extracted 12,837 code generation errors and conducted an in-depth analysis of their causes, which led to the identification of 19 distinct error causes. Our empirical analysis indicated that three of these causes can be directly fixed. Consequently, we proposed a fixing method called LlmFix, which addresses these three types of errors through a three-step process: filtering code for indentation correction, truncating redundant generated code, and importing missing modules. Experimental results demonstrate that LlmFix can fix these three types of errors, significantly improving the performance of 14 LLMs on HumanEval and MBPP datasets with average increases of 9.5% and 5.4%, respectively.
Related papers
- A Deep Dive Into Large Language Model Code Generation Mistakes: What and Why? [9.246899995643918]
Large Language Models can still generate defective code that deviates from the specification.
Seven categories of non-syntactic mistakes were identified through extensive manual analyses.
Our evaluation demonstrated that GPT-4 with the ReAct prompting technique can achieve an F1 score of up to 0.65 when identifying reasons for LLM's mistakes.
arXiv Detail & Related papers (2024-11-03T02:47:03Z) - See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Where Do Large Language Models Fail When Generating Code? [10.519984835232359]
Large Language Models (LLMs) have shown great potential in code generation.
It is unclear what kinds of code generation errors LLMs can make.
We analyzed incorrect code snippets generated by six popular LLMs on the HumanEval dataset.
arXiv Detail & Related papers (2024-06-13T01:29:52Z) - Evaluating LLMs at Detecting Errors in LLM Responses [30.645694514606507]
This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs.
We use ReaLMistake to evaluate error detectors based on 12 Large Language Models.
arXiv Detail & Related papers (2024-04-04T17:19:47Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - Evaluating Diverse Large Language Models for Automatic and General Bug
Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation.
Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z) - Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems.
This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.