Related papers: LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

URL: http://arxiv.org/abs/2402.16906v5
Date: Tue, 4 Jun 2024 06:55:27 GMT
Title: LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
Authors: Lily Zhong, Zilong Wang, Jingbo Shang,
Abstract summary: Large language models (LLMs) are leading significant progress in code generation. In this study, we introduce Large Language Model Debugger (LDB) LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution.
Score: 35.76881887942524
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce Large Language Model Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.

Related papers

On LLM-Assisted Generation of Smart Contracts from Business Processes [0.08192907805418582]
Large language models (LLMs) have changed the reality of how software is produced.<n>We present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions.<n>Our results show that LLM performance falls short of the perfect reliability required for smart contract development.
arXiv Detail & Related papers (2025-07-30T20:39:45Z)
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors [5.247363735860479]
Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks. Given LLMs' ability to understand and process diverse programs, they present a promising direction for building general-purpose surrogate models. We introduce SURGE, a benchmark with $1160$ problems covering $8$ key aspects. Through empirical analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy.
arXiv Detail & Related papers (2025-02-16T15:38:19Z)
RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance [0.6062751776009752]
Large Language Models (LLMs) have shown incredible potential in code generation tasks. LLMs can generate code based on task descriptions, but accuracy remains limited. We introduce a novel architecture of LLM-based agents for code generation and automatic debug: Refinement and Guidance debugger (RGD) RGD decomposes the code generation task into multiple steps, ensuring a clearer workflow and enabling iterative code refinement based on self-reflection and feedback.
arXiv Detail & Related papers (2024-10-02T05:07:02Z)
Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement [29.667170755786508]
This paper first introduces EVAL, a benchmark designed to evaluate the debug capabilities of Large Language Models (LLMs) Master generates refined code data according to the defined tasks for supervised finetuning. Finally, the Code Learner acts as a critic and reserves the generated problems that it can not solve.
arXiv Detail & Related papers (2024-08-09T11:35:44Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
MEIC: Re-thinking RTL Debug Automation using LLMs [18.964523115622928]
This work introduces a novel framework, Make Each Iteration Count (MEIC) MEIC is suitable for identifying and correcting both syntax and function errors. To evaluate our framework, we provide an open-source dataset comprising 178 common RTL programming errors.
arXiv Detail & Related papers (2024-05-10T22:32:39Z)
Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. Code reasoning is one of the most essential abilities of code LLMs. We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z)
DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references. It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.