Related papers: From Illusion to Insight: Change-Aware File-Level Software Defect Prediction Using Agentic AI

From Illusion to Insight: Change-Aware File-Level Software Defect Prediction Using Agentic AI

URL: http://arxiv.org/abs/2512.23875v1
Date: Mon, 29 Dec 2025 21:32:29 GMT
Title: From Illusion to Insight: Change-Aware File-Level Software Defect Prediction Using Agentic AI
Authors: Mohsen Hesamolhokama, Behnam Rohani, Amirahmad Shafiee, MohammadAmin Fazli, Jafar Habibi,
Abstract summary: Much of the reported progress in file-level software defect prediction (SDP) is, in reality, nothing but an illusion of accuracy.<n>We reformulate SDP as a change-aware prediction task, in which models reason over code changes of a file within successive project versions.<n>Our experiments on multiple PROMISE projects show that traditional models achieve inflated F1, while failing on rare but critical defect-transition cases.
Score: 2.8583947164719348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Much of the reported progress in file-level software defect prediction (SDP) is, in reality, nothing but an illusion of accuracy. Over the last decades, machine learning and deep learning models have reported increasing performance across software versions. However, since most files persist across releases and retain their defect labels, standard evaluation rewards label-persistence bias rather than reasoning about code changes. To address this issue, we reformulate SDP as a change-aware prediction task, in which models reason over code changes of a file within successive project versions, rather than relying on static file snapshots. Building on this formulation, we propose an LLM-driven, change-aware, multi-agent debate framework. Our experiments on multiple PROMISE projects show that traditional models achieve inflated F1, while failing on rare but critical defect-transition cases. In contrast, our change-aware reasoning and multi-agent debate framework yields more balanced performance across evolution subsets and significantly improves sensitivity to defect introductions. These results highlight fundamental flaws in current SDP evaluation practices and emphasize the need for change-aware reasoning in practical defect prediction. The source code is publicly available.

Related papers

PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise [60.63315470285562]
MiniTruePrefixes is a novel specialized model that better detects factual inconsistencies over text prefixes.<n>We show that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization.
arXiv Detail & Related papers (2025-11-03T09:07:44Z)
Adapting Language Balance in Code-Switching Speech [60.296574524609575]
Large foundational models still struggle against code-switching test cases.<n>We use differentiable surrogates to mitigate context bias during generation.<n>Experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly.
arXiv Detail & Related papers (2025-10-21T15:23:55Z)
Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset [0.0]
We present ReDef, a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects.<n>Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks.<n>This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources.
arXiv Detail & Related papers (2025-09-11T07:07:11Z)
Bug Destiny Prediction in Large Open-Source Software Repositories through Sentiment Analysis and BERT Topic Modeling [3.481985817302898]
We leverage features available before a bug is resolved to enhance predictive accuracy.<n>Our methodology incorporates sentiment analysis to derive both an emotionality score and a sentiment classification.<n>Results demonstrate that sentiment analysis serves as a valuable predictor of a bug's eventual outcome.
arXiv Detail & Related papers (2025-04-22T15:18:14Z)
Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z)
Code Revert Prediction with Graph Neural Networks: A Case Study at J.P. Morgan Chase [10.961209762486684]
Code revert prediction aims to forecast or predict the likelihood of code changes being reverted or rolled back in software development. Previous methods for code defect detection relied on independent features but ignored relationships between code scripts. This paper presents a systematic empirical study for code revert prediction that integrates the code import graph with code features.
arXiv Detail & Related papers (2024-03-14T15:54:29Z)
IRJIT: A Simple, Online, Information Retrieval Approach for Just-In-Time Software Defect Prediction [10.084626547964389]
Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time. Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models. We propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits.
arXiv Detail & Related papers (2022-10-05T17:54:53Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Graph-Based Machine Learning Improves Just-in-Time Defect Prediction [0.38073142980732994]
We use graph-based machine learning to improve Just-In-Time (JIT) defect prediction. We show that our best model can predict whether or not a code change will lead to a defect with an F1 score as high as 77.55%. This represents a 152% higher F1 score and a 3% higher MCC over the state-of-the-art JIT defect prediction.
arXiv Detail & Related papers (2021-10-11T16:00:02Z)
Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users. We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.