Teaching Large Language Models to Self-Debug
- URL: http://arxiv.org/abs/2304.05128v2
- Date: Thu, 5 Oct 2023 09:12:07 GMT
- Title: Teaching Large Language Models to Self-Debug
- Authors: Xinyun Chen, Maxwell Lin, Nathanael Sch\"arli, Denny Zhou
- Abstract summary: Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
- Score: 62.424077000154945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved impressive performance on code
generation. However, for complex programming tasks, generating the correct
solution in one go becomes challenging, thus some prior works have designed
program repair approaches to improve code generation performance. In this work,
we propose Self-Debugging, which teaches a large language model to debug its
predicted program via few-shot demonstrations. In particular, we demonstrate
that Self-Debugging can teach the large language model to perform rubber duck
debugging; i.e., without any human feedback on the code correctness or error
messages, the model is able to identify its mistakes by investigating the
execution results and explaining the generated code in natural language.
Self-Debugging achieves the state-of-the-art performance on several code
generation benchmarks, including the Spider dataset for text-to-SQL generation,
TransCoder for C++-to-Python translation, and MBPP for text-to-Python
generation. On the Spider benchmark where there are no unit tests to verify the
correctness of predictions, Self-Debugging with code explanation consistently
improves the baseline by 2-3%, and improves the prediction accuracy on problems
of the hardest level by 9%. On TransCoder and MBPP where unit tests are
available, Self-Debugging improves the baseline accuracy by up to 12%.
Meanwhile, by leveraging feedback messages and reusing failed predictions,
Self-Debugging notably improves sample efficiency, and can match or outperform
baseline models that generate more than 10x candidate programs.
Related papers
- Leveraging Print Debugging to Improve Code Generation in Large Language
Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks.
But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal.
We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - AstBERT: Enabling Language Model for Code Understanding with Abstract
Syntax Tree [3.1087379479634927]
We propose the AstBERT model, a pre-trained language model aiming to better understand the programming language (PL) using the abstract syntax tree (AST)
Specifically, we collect a colossal amount of source codes (both java and python) from GitHub, in which information of the source codes can be interpreted and integrated.
Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks.
arXiv Detail & Related papers (2022-01-20T03:27:26Z) - Program Synthesis with Large Language Models [40.41120807053989]
We evaluate large language models for program synthesis in Python.
We find that synthesis performance scales log-linearly with model size.
We find that even our best models are generally unable to predict the output of a program given a specific input.
arXiv Detail & Related papers (2021-08-16T03:57:30Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub.
We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch.
We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.