DebugBench: Evaluating Debugging Capability of Large Language Models
- URL: http://arxiv.org/abs/2401.04621v3
- Date: Thu, 6 Jun 2024 06:10:00 GMT
- Title: DebugBench: Evaluating Debugging Capability of Large Language Models
- Authors: Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun,
- Abstract summary: DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
- Score: 80.73121177868357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.
Related papers
- MdEval: Massively Multilingual Code Debugging [37.48700033342978]
We propose the first massively multilingual debug benchmark, which includes 3.6K test samples of 18 programming languages.
We introduce the instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions.
Our experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs.
arXiv Detail & Related papers (2024-11-04T17:36:40Z) - See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement [29.667170755786508]
This paper first introduces EVAL, a benchmark designed to evaluate the debug capabilities of Large Language Models (LLMs)
Master generates refined code data according to the defined tasks for supervised finetuning.
Finally, the Code Learner acts as a critic and reserves the generated problems that it can not solve.
arXiv Detail & Related papers (2024-08-09T11:35:44Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Leveraging Print Debugging to Improve Code Generation in Large Language
Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks.
But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal.
We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z) - Language Models are Better Bug Detector Through Code-Pair Classification [0.26107298043931204]
In this paper, we propose code-pair classification task in which both the buggy and non-buggy versions are given to the model, and the model identifies the buggy ones.
Experiments indicate that an LLM can often pick the buggy from the non-buggy version of the code, and the code-pair classification task is much easier compared to be given a snippet and deciding if and where a bug exists.
arXiv Detail & Related papers (2023-11-14T07:20:57Z) - Evaluating Diverse Large Language Models for Automatic and General Bug
Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation.
Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z) - Isolating Compiler Bugs by Generating Effective Witness Programs with Large Language Models [10.660543763757518]
Existing compiler bug isolation approaches convert the problem into a test program mutation problem.
We propose a new approach named LLM4CBI to utilize LLMs to generate effective test programs for compiler bug isolation.
Compared with state-of-the-art approaches over 120 real bugs from GCC and LLVM, our evaluation demonstrates the advantages of LLM4CBI.
arXiv Detail & Related papers (2023-07-02T15:20:54Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.