Human-in-the-loop online just-in-time software defect prediction
- URL: http://arxiv.org/abs/2308.13707v1
- Date: Fri, 25 Aug 2023 23:40:08 GMT
- Title: Human-in-the-loop online just-in-time software defect prediction
- Authors: Xutong Liu, Yufei Zhou, Yutian Tang, Junyan Qian, Yuming Zhou
- Abstract summary: We propose Human-In-The-Loop (HITL) O-JIT-SDP that integrates feedback from SQA staff to enhance the prediction process.
We also introduce a performance evaluation framework that utilizes a k-fold distributed bootstrap method along with the Wilcoxon signed-rank test.
These advancements hold the potential to significantly enhance the value of O-JIT-SDP for industrial applications.
- Score: 6.35776510153759
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Online Just-In-Time Software Defect Prediction (O-JIT-SDP) uses an online
model to predict whether a new software change will introduce a bug or not.
However, existing studies neglect the interaction of Software Quality Assurance
(SQA) staff with the model, which may miss the opportunity to improve the
prediction accuracy through the feedback from SQA staff. To tackle this
problem, we propose Human-In-The-Loop (HITL) O-JIT-SDP that integrates feedback
from SQA staff to enhance the prediction process. Furthermore, we introduce a
performance evaluation framework that utilizes a k-fold distributed bootstrap
method along with the Wilcoxon signed-rank test. This framework facilitates
thorough pairwise comparisons of alternative classification algorithms using a
prequential evaluation approach. Our proposal enables continuous statistical
testing throughout the prequential process, empowering developers to make
real-time decisions based on robust statistical evidence. Through
experimentation across 10 GitHub projects, we demonstrate that our evaluation
framework enhances the credibility of model evaluation, and the incorporation
of HITL feedback elevates the prediction performance of online JIT-SDP models.
These advancements hold the potential to significantly enhance the value of
O-JIT-SDP for industrial applications.
Related papers
- A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework in Large Language Models (LLMs)
We derive novel metrics with high-probability guarantees concerning the output distribution of a model.
Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Deep Incremental Learning of Imbalanced Data for Just-In-Time Software
Defect Prediction [3.2022080692044352]
This work stems from three observations on prior Just-In-Time Software Defect Prediction (JIT-SDP) models.
First, prior studies treat the JIT-SDP problem solely as a classification problem.
Second, prior JIT-SDP studies do not consider that class balancing processing may change the underlying characteristics of software changeset data.
arXiv Detail & Related papers (2023-10-18T19:42:34Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - IRJIT: A Simple, Online, Information Retrieval Approach for Just-In-Time Software Defect Prediction [10.084626547964389]
Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time.
Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models.
We propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits.
arXiv Detail & Related papers (2022-10-05T17:54:53Z) - A Reinforcement Learning Framework for PQoS in a Teleoperated Driving
Scenario [18.54699818319184]
We propose the design of a new entity, implemented at the RAN-level, that implements PQoS functionalities.
Specifically, we focus on the design of the reward function of the learning agent, able to convert estimates into appropriate countermeasures if requirements are not satisfied.
We demonstrate via ns-3 simulations that our approach achieves the best trade-off in terms of Quality of Experience (QoE) performance of end users in a teledriving-like scenario.
arXiv Detail & Related papers (2022-02-04T02:59:16Z) - Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points.
Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters.
We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.