Related papers: SBEST: Spectrum-Based Fault Localization Without Fault-Triggering Tests

SBEST: Spectrum-Based Fault Localization Without Fault-Triggering Tests

URL: http://arxiv.org/abs/2405.00565v2
Date: Mon, 27 Oct 2025 16:01:49 GMT
Title: SBEST: Spectrum-Based Fault Localization Without Fault-Triggering Tests
Authors: Md Nakhla Rafi, Lorena Barreto Simedo Pacheco, An Ran Chen, Jinqiu Yang, Tse-Hsun, Chen,
Abstract summary: This study investigates the feasibility of using stack traces from crash reports as proxies for fault-triggering tests in Spectrum-Based Fault localization.<n>We propose SBEST, a novel approach that integrates stack trace information with test coverage data to perform fault localization when fault-triggering tests are missing.
Score: 17.90798133817018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fault localization is a critical step in software maintenance. Yet, many existing techniques, such as Spectrum-Based Fault Localization (SBFL), rely heavily on the availability of fault-triggering tests to be effective. In practice, especially for crash-related bugs, such tests are frequently unavailable. Meanwhile, bug reports containing stack traces often serve as the only available evidence of runtime failures and provide valuable context for debugging. This study investigates the feasibility of using stack traces from crash reports as proxies for fault-triggering tests in SBFL. Our empirical analysis of 60 crash-report bugs in Defects4J reveals that only 3.33% of these bugs have fault-triggering tests available at the time of the bug report creation. However, 98.3% of bug fixes directly address the exception observed in the stack trace, and 78.3% of buggy methods are reachable within an average of 0.34 method calls from the stack trace. These findings underscore the diagnostic value of stack traces in the absence of failing tests. Motivated by these findings, we propose SBEST, a novel approach that integrates stack trace information with test coverage data to perform fault localization when fault-triggering tests are missing. SBEST shows an improvement, with a 32.22% increase in Mean Average Precision (MAP) and a 17.43% increase in Mean Reciprocal Rank (MRR) compared to baseline approaches under the scenario where fault-triggering tests are absent.

Related papers

Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z)
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills [59.003563837981886]
High quality bugs are key to training the next generation of language model based software engineering (SWE) agents.<n>We introduce a novel method for synthetic generation of difficult and diverse bugs.
arXiv Detail & Related papers (2025-10-22T17:58:56Z)
AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests [0.7564784873669823]
We introduce AssertFlip, a technique for automatically generating Bug Reproducible Tests (BRTs) using large language models (LLMs)<n>AssertFlip first generates passing tests on the buggy behaviour and then inverts these tests to fail when the bug is present.<n>Our results show that AssertFlip outperforms all known techniques in the leaderboard of SWT-Bench, a benchmark curated for BRTs.
arXiv Detail & Related papers (2025-07-23T14:19:55Z)
Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [84.30534714651093]
We present an innovative APR tool for Dafny, a verification-aware programming language.<n>We localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program.<n>We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
Black-Box Test Code Fault Localization Driven by Large Language Models and Execution Estimation [7.040370156228408]
We introduce a fully static, LLM-driven approach for system test code fault localization.<n>Our method uses a single failure execution log to estimate the test's execution trace.<n>We evaluate our technique at function, block, and line levels using an industrial dataset of faulty test cases.
arXiv Detail & Related papers (2025-06-23T19:04:51Z)
A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions [9.141981611891715]
DISTINCT is a Description-guided, branch-consistency analysis framework.<n>It transforms Large Language Model (LLM)-based generators into fault-aware test generators.<n>It achieves an average improvement of 14.64% in Compilation Success Rate (CSR) and 6.66% in Passing Rate (PR)
arXiv Detail & Related papers (2025-06-09T07:05:48Z)
Studying the Impact of Early Test Termination Due to Assertion Failure on Code Coverage and Spectrum-based Fault Localization [48.22524837906857]
This study is the first empirical study on early test termination due to assertion failure. We investigated 207 versions of 6 open-source projects. Our findings indicate that early test termination harms both code coverage and the effectiveness of spectrum-based fault localization.
arXiv Detail & Related papers (2025-04-06T17:14:09Z)
Where's the Bug? Attention Probing for Scalable Fault Localization [18.699014321422023]
We present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels. BAP is significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.
arXiv Detail & Related papers (2025-02-19T18:59:32Z)
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay [76.06127233986663]
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time. This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist. We propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch.
arXiv Detail & Related papers (2024-07-22T16:25:41Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Back to the Future! Studying Data Cleanness in Defects4J and its Impact on Fault Localization [3.8040257966829802]
We examine Defects4J's fault-triggering tests, emphasizing the implications of developer knowledge of SBFL techniques. We found that 55% of the fault-triggering tests were newly added to replicate the bug or to test for regression. We also found that 22% of the fault-triggering tests were modified after the bug reports were created, containing developer knowledge of the bug.
arXiv Detail & Related papers (2023-10-29T20:19:06Z)
SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models [12.21559364043576]
SkipAnalyzer is a large language model (LLM)-powered tool for static code analysis. As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks.
arXiv Detail & Related papers (2023-10-27T23:17:42Z)
Improving Spectrum-Based Localization of Multiple Faults by Iterative Test Suite Reduction [0.30458514384586394]
We present FLITSR, a novel SBFL extension that improves the localization of a given base metric in the presence of multiple faults. For all three spectrum types we consistently see substantial reductions of the average wasted efforts at different fault levels, of 30%-90% over the best base metric. For the method-level real faults, FLITSR also substantially outperforms GRACE, a state-of-the-art learning-based fault localizer.
arXiv Detail & Related papers (2023-06-16T15:00:40Z)
All Points Matter: Entropy-Regularized Distribution Alignment for Weakly-supervised 3D Segmentation [67.30502812804271]
Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning. We propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions.
arXiv Detail & Related papers (2023-05-25T08:19:31Z)
Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction [14.444294152595429]
The number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size. We propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks. Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases.
arXiv Detail & Related papers (2022-09-23T10:50:47Z)
Infrared: A Meta Bug Detector [10.541969253100815]
We propose a new approach, called meta bug detection, which offers three crucial advantages over existing learning-based bug detectors. Our evaluation shows our meta bug detector (MBD) is effective in catching a variety of bugs including null pointer dereference, array index out-of-bound, file handle leak, and even data races in concurrent programs.
arXiv Detail & Related papers (2022-09-18T09:08:51Z)
An Empirical Study on Bug Severity Estimation using Source Code Metrics and Static Analysis [0.8621608193534838]
We study 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that code metrics are useful in predicting buggy code, but they cannot estimate the severity level of the bugs. Our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity.
arXiv Detail & Related papers (2022-06-26T17:07:23Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning. It is based on a biLSTM encoder and a fully-connected classifier to compute similarity. Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.