DebugBench: Evaluating Debugging Capability of Large Language Models
- URL: http://arxiv.org/abs/2401.04621v3
- Date: Thu, 6 Jun 2024 06:10:00 GMT
- Title: DebugBench: Evaluating Debugging Capability of Large Language Models
- Authors: Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun,
- Abstract summary: DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
- Score: 80.73121177868357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.
Related papers
- What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark that demands a meta reasoning skill.
MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Leveraging Print Debugging to Improve Code Generation in Large Language
Models [63.63160583432348]
Large language models (LLMs) have made significant progress in code generation tasks.
But their performance in tackling programming problems with complex data structures and algorithms remains suboptimal.
We propose an in-context learning approach that guides LLMs to debug by using a "print debug" method.
arXiv Detail & Related papers (2024-01-10T18:37:59Z) - Language Models are Better Bug Detector Through Code-Pair Classification [0.26107298043931204]
In this paper, we propose code-pair classification task in which both the buggy and non-buggy versions are given to the model, and the model identifies the buggy ones.
Experiments indicate that an LLM can often pick the buggy from the non-buggy version of the code, and the code-pair classification task is much easier compared to be given a snippet and deciding if and where a bug exists.
arXiv Detail & Related papers (2023-11-14T07:20:57Z) - Evaluating Diverse Large Language Models for Automatic and General Bug
Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation.
Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z) - The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging
Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities.
Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included.
We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z) - Isolating Compiler Bugs by Generating Effective Witness Programs with Large Language Models [10.660543763757518]
Existing compiler bug isolation approaches convert the problem into a test program mutation problem.
We propose a new approach named LLM4CBI to utilize LLMs to generate effective test programs for compiler bug isolation.
Compared with state-of-the-art approaches over 120 real bugs from GCC and LLVM, our evaluation demonstrates the advantages of LLM4CBI.
arXiv Detail & Related papers (2023-07-02T15:20:54Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.