Teaching Large Language Models to Self-Debug
- URL: http://arxiv.org/abs/2304.05128v2
- Date: Thu, 5 Oct 2023 09:12:07 GMT
- Title: Teaching Large Language Models to Self-Debug
- Authors: Xinyun Chen, Maxwell Lin, Nathanael Sch\"arli, Denny Zhou
- Abstract summary: Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
- Score: 62.424077000154945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved impressive performance on code
generation. However, for complex programming tasks, generating the correct
solution in one go becomes challenging, thus some prior works have designed
program repair approaches to improve code generation performance. In this work,
we propose Self-Debugging, which teaches a large language model to debug its
predicted program via few-shot demonstrations. In particular, we demonstrate
that Self-Debugging can teach the large language model to perform rubber duck
debugging; i.e., without any human feedback on the code correctness or error
messages, the model is able to identify its mistakes by investigating the
execution results and explaining the generated code in natural language.
Self-Debugging achieves the state-of-the-art performance on several code
generation benchmarks, including the Spider dataset for text-to-SQL generation,
TransCoder for C++-to-Python translation, and MBPP for text-to-Python
generation. On the Spider benchmark where there are no unit tests to verify the
correctness of predictions, Self-Debugging with code explanation consistently
improves the baseline by 2-3%, and improves the prediction accuracy on problems
of the hardest level by 9%. On TransCoder and MBPP where unit tests are
available, Self-Debugging improves the baseline accuracy by up to 12%.
Meanwhile, by leveraging feedback messages and reusing failed predictions,
Self-Debugging notably improves sample efficiency, and can match or outperform
baseline models that generate more than 10x candidate programs.
Related papers
- MdEval: Massively Multilingual Code Debugging [37.48700033342978]
We propose the first massively multilingual debug benchmark, which includes 3.6K test samples of 18 programming languages.
We introduce the instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions.
Our experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs.
arXiv Detail & Related papers (2024-11-04T17:36:40Z) - From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging [5.910272203315325]
We introduce Multi-Granularity Debugger (MG Debugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity.
MG Debugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error.
It achieves an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix.
arXiv Detail & Related papers (2024-10-02T03:57:21Z) - Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation [0.24578723416255752]
We evaluate five different large language models (LLMs) concerning their capabilities for text-to-code generation.
ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama.
arXiv Detail & Related papers (2024-09-06T10:03:49Z) - Leveraging Print Debugging to Improve Code Generation in Large Language
Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks.
But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal.
We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub.
We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch.
We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.