Related papers: Where's the Bug? Attention Probing for Scalable Fault Localization

Where's the Bug? Attention Probing for Scalable Fault Localization

URL: http://arxiv.org/abs/2502.13966v2
Date: Thu, 20 Feb 2025 02:29:19 GMT
Title: Where's the Bug? Attention Probing for Scalable Fault Localization
Authors: Adam Stein, Arthur Wayne, Aaditya Naik, Mayur Naik, Eric Wong,
Abstract summary: We present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels.<n>BAP is significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.
Score: 18.699014321422023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks. While LLM-based program repair systems can propose bug fixes using only a user's bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs. Existing FL approaches rely on executable test cases, require training on costly and often noisy line-level annotations, or demand resource-intensive LLMs. In this paper, we present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels, outperforming traditional FL baselines and prompting of large-scale LLMs. We evaluate our approach across a variety of code settings, including real-world Java bugs from the standard Defects4J dataset as well as seven other datasets which span a diverse set of bug types and languages. Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4o. BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.

Related papers

Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Enhancing Fault Localization Through Ordered Code Analysis with LLM Agents and Self-Reflection [8.22737389683156]
Large Language Models (LLMs) offer promising improvements in fault localization by enhancing code comprehension and reasoning. We introduce LLM4FL, a novel LLM-agent-based fault localization approach that integrates SBFL rankings with a divide-and-conquer strategy. Our results demonstrate that LLM4FL outperforms AutoFL by 19.27% in Top-1 accuracy and surpasses state-of-the-art supervised techniques such as DeepFL and Grace.
arXiv Detail & Related papers (2024-09-20T16:47:34Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
Leveraging Stack Traces for Spectrum-based Fault Localization in the Absence of Failing Tests [44.13331329339185]
We introduce a new approach, SBEST, that integrates stack trace data with test coverage to enhance fault localization. Our approach shows a significant improvement, increasing Mean Average Precision (MAP) by 32.22% and Mean Reciprocal Rank (MRR) by 17.43% over traditional stack trace ranking methods.
arXiv Detail & Related papers (2024-05-01T15:15:52Z)
A Deep Dive into Large Language Models for Automated Bug Localization and Repair [12.756202755547024]
Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR) In this study, we take a deep dive into automated bug fixing utilizing LLMs. This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information. Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark.
arXiv Detail & Related papers (2024-04-17T17:48:18Z)
DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models [3.1690235522182104]
Large language models (LLMs) are increasingly used to solve various programming tasks. We show that the task is difficult as it requires the model to learn long-range code relationships. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs.
arXiv Detail & Related papers (2024-02-19T18:35:40Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education. We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z)
Large Language Models for Test-Free Fault Localization [11.080712737595174]
We propose a language model based fault localization approach that locates buggy lines of code without any test coverage information. We fine-tune language models with 350 million, 6 billion, and 16 billion parameters on small, manually curated corpora of buggy programs. Our empirical evaluation shows that LLMAO improves the Top-1 results over the state-of-the-art machine learning fault localization (MLFL) baselines by 2.3%-54.4%, and Top-5 results by 14.4%-35.6%.
arXiv Detail & Related papers (2023-10-03T01:26:39Z)
Large Language Models in Fault Localisation [32.87044163543427]
This paper investigates the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Within function-level context, ChatGPT-4 outperforms all the existing fault localisation methods. However, when the code context of the Defects4J dataset expands to the class-level, ChatGPT-4's performance suffers a significant drop.
arXiv Detail & Related papers (2023-08-29T13:07:27Z)
Communication-Efficient Robust Federated Learning with Noisy Labels [144.31995882209932]
Federated learning (FL) is a promising privacy-preserving machine learning paradigm over distributed located data. We propose a learning-based reweighting approach to mitigate the effect of noisy labels in FL. Our approach has shown superior performance on several real-world datasets compared to various baselines.
arXiv Detail & Related papers (2022-06-11T16:21:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.