Fixing Function-Level Code Generation Errors for Foundation Large Language Models
- URL: http://arxiv.org/abs/2409.00676v2
- Date: Sat, 18 Jan 2025 14:34:48 GMT
- Title: Fixing Function-Level Code Generation Errors for Foundation Large Language Models
- Authors: Hao Wen, Yueheng Zhu, Chao Liu, Xiaoxue Ren, Weiwei Du, Meng Yan,
- Abstract summary: We conduct an empirical study on the generation errors and conduct an analysis of their causes, leading to 19 categories of error causes.
Our empirical analysis indicated that three of these causes can be directly fixed.
We propose a fixing method called LlmFix, which addresses these three types of errors through a three-step process.
- Score: 6.137340149146578
- License:
- Abstract: Function-level code generation leverages foundation Large Language Models (LLMs) to automatically produce source code with expected functionality. It has been widely investigated and applied in intelligent programming assistants, such as GitHub Copilot, to enhance software development productivity. Despite advancements in foundation LLMs, the generation involves many errors. Existing studies leverage static analysis tools (e.g., TBar) or add another fixing LLM (i.e., LDB) to post-process these errors. However, there are still many errors remaining to be solved because their root causes have not been investigated yet, making it challenging to design better fixing tools. In this paper, we first conducted an empirical study on the generation errors. Specifically, we reproduced 14 representative LLMs on the HumanEval dataset and verified their correctness. We obtained 12,837 code generation errors and conducted an analysis of their causes, leading to 19 categories of error causes. Our empirical analysis indicated that three of these causes can be directly fixed. Based on the findings, we proposed a fixing method called LlmFix, which addresses these three types of errors through a three-step process: filtering code for indentation correction, truncating redundant generated code, and importing missing modules. Evaluations of LlmFix are conducted from two perspectives: its performance on error-fixing tasks and its impact on improving function-level code generation tasks. For error fixing performance, we built an evaluation dataset LlmErrorEval. Experimental results show that LlmFix achieves a fix rate of 17.1% outperforming the best LDB by 8.9%. For code generation improvements, evaluations of LlmFix on both the HumanEval and MBPP datasets demonstrate its effectiveness, improving code generation accuracy by an average of 7.5% across 14 LLMs.
Related papers
- ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation [31.363781211927947]
Large language models (LLMs) have achieved impressive performance in code generation.
LLMs are susceptible to error accumulation during code generation.
We propose ROCODE, which integrates the backtracking mechanism and program analysis into LLMs for code generation.
arXiv Detail & Related papers (2024-11-11T16:39:13Z) - Rectifier: Code Translation with Corrector via LLMs [11.38401806203093]
We propose a general corrector, namely Rectifier, which is a micro and universal model for repairing translation errors.
The experimental results on translation tasks between C++, Java, and Python show that our model has effective repair ability.
arXiv Detail & Related papers (2024-07-10T08:58:41Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - MEIC: Re-thinking RTL Debug Automation using LLMs [18.964523115622928]
This work introduces a novel framework, Make Each Iteration Count (MEIC)
MEIC is suitable for identifying and correcting both syntax and function errors.
To evaluate our framework, we provide an open-source dataset comprising 178 common RTL programming errors.
arXiv Detail & Related papers (2024-05-10T22:32:39Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems.
This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - Towards Generating Functionally Correct Code Edits from Natural Language
Issue Descriptions [11.327913840111378]
We introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes.
We empirically evaluate the performance of several state-of-the-art LLMs for the this task.
arXiv Detail & Related papers (2023-04-07T18:58:33Z) - Break-It-Fix-It: Unsupervised Learning for Program Repair [90.55497679266442]
We propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas.
We use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data.
Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data.
BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python and 71.7% on DeepFix.
arXiv Detail & Related papers (2021-06-11T20:31:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.