Related papers: A Multi-Agent Approach to Fault Localization via Graph-Based Retrieval and Reflexion

A Multi-Agent Approach to Fault Localization via Graph-Based Retrieval and Reflexion

URL: http://arxiv.org/abs/2409.13642v2
Date: Wed, 19 Mar 2025 04:22:52 GMT
Title: A Multi-Agent Approach to Fault Localization via Graph-Based Retrieval and Reflexion
Authors: Md Nakhla Rafi, Dong Jae Kim, Tse-Hsun Chen, Shaowei Wang,
Abstract summary: Traditional fault localization techniques require extensive training datasets and high computational resources.<n>Recent advances in Large Language Models (LLMs) offer new opportunities by enhancing code understanding and reasoning.<n>We propose LLM4FL, a multi-agent fault localization framework that utilizes three specialized LLM agents.<n> evaluated on the Defects4J benchmark, which includes 675 faults from 14 Java projects, LLM4FL achieves an 18.55% improvement in Top-1 accuracy over AutoFL and 4.82% over SoapFL.
Score: 8.22737389683156
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Identifying and resolving software faults remains a challenging and resource-intensive process. Traditional fault localization techniques, such as Spectrum-Based Fault Localization (SBFL), leverage statistical analysis of test coverage but often suffer from limited accuracy. While learning-based approaches improve fault localization, they demand extensive training datasets and high computational resources. Recent advances in Large Language Models (LLMs) offer new opportunities by enhancing code understanding and reasoning. However, existing LLM-based fault localization techniques face significant challenges, including token limitations, performance degradation with long inputs, and scalability issues in complex software systems. To overcome these obstacles, we propose LLM4FL, a multi-agent fault localization framework that utilizes three specialized LLM agents. First, the Context Extraction Agent applies an order-sensitive segmentation strategy to partition large coverage data within the LLM's token limit, analyze failure context, and prioritize failure-related methods. The Debugger Agent then processes the extracted data, which employs graph-based retrieval-augmented code navigation to reason about failure causes and rank suspicious methods. Finally, the Reviewer Agent re-evaluates the identified faulty methods using verbal reinforcement learning, engaging in self-criticism and iterative refinement. Evaluated on the Defects4J (V2.0.0) benchmark, which includes 675 faults from 14 Java projects, LLM4FL achieves an 18.55\% improvement in Top-1 accuracy over AutoFL and 4.82\% over SoapFL. It outperforms supervised techniques such as DeepFL and Grace, all without requiring task-specific training. Furthermore, its coverage segmentation and prompt chaining strategies enhance performance, increasing Top-1 accuracy by up to 22\%.

Related papers

Context-Enhanced Vulnerability Detection Based on Large Language Model [17.922081397554155]
We propose a context-enhanced vulnerability detection approach that combines program analysis with large language models. Specifically, we use program analysis to extract contextual information at various levels of abstraction, thereby filtering out irrelevant noise. Our goal is to strike a balance between providing sufficient detail to accurately capture vulnerabilities and minimizing unnecessary complexity.
arXiv Detail & Related papers (2025-04-23T16:54:16Z)
Can LLMs handle WebShell detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework [11.613261852608062]
WebShell attacks, in which malicious scripts are injected into web servers, are a major cybersecurity threat. This work is the first to explore the feasibility and limitations of Large Language Models for WebShell detection.
arXiv Detail & Related papers (2025-04-14T21:09:37Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Program Slicing in the Era of Large Language Models [7.990456190723922]
Program slicing is a critical technique in software engineering, enabling developers to isolate relevant portions of code. This study investigates the application of large language models (LLMs) to both static and dynamic program slicing.
arXiv Detail & Related papers (2024-09-19T00:07:56Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
Evaluation and Improvement of Fault Detection for Large Language Models [30.760472387136954]
This paper investigates the effectiveness of existing fault detection methods for large language models (LLMs) We propose textbfMuCS, a prompt textbfMutation-based prediction textbfConfidence textbfSmoothing framework to boost the fault detection capability of existing methods.
arXiv Detail & Related papers (2024-04-14T07:06:12Z)
Aligning LLMs for FL-free Program Repair [14.935596175148586]
This paper investigates a new approach to adapt large language models (LLMs) to program repair. Our core insight is that LLM's APR capability can be greatly improved by simply aligning the output to their training objective. Based on this insight, we designed D4C, a straightforward prompting framework for APR.
arXiv Detail & Related papers (2024-04-13T02:36:40Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
AgentFL: Scaling LLM-based Fault Localization to Project-Level Context [11.147750199280813]
This paper presents AgentFL, a multi-agent system based on ChatGPT for automated fault localization. By simulating the behavior of a human developer, AgentFL models the FL task as a three-step process, which involves comprehension, navigation, and confirmation. The evaluation on the widely used Defects4J-V1.2.0 benchmark shows that AgentFL can localize 157 out of 395 bugs within Top-1.
arXiv Detail & Related papers (2024-03-25T01:58:19Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Large Language Models for Test-Free Fault Localization [11.080712737595174]
We propose a language model based fault localization approach that locates buggy lines of code without any test coverage information. We fine-tune language models with 350 million, 6 billion, and 16 billion parameters on small, manually curated corpora of buggy programs. Our empirical evaluation shows that LLMAO improves the Top-1 results over the state-of-the-art machine learning fault localization (MLFL) baselines by 2.3%-54.4%, and Top-5 results by 14.4%-35.6%.
arXiv Detail & Related papers (2023-10-03T01:26:39Z)
Large Language Models in Fault Localisation [32.87044163543427]
This paper investigates the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Within function-level context, ChatGPT-4 outperforms all the existing fault localisation methods. However, when the code context of the Defects4J dataset expands to the class-level, ChatGPT-4's performance suffers a significant drop.
arXiv Detail & Related papers (2023-08-29T13:07:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.