Deep Incremental Learning of Imbalanced Data for Just-In-Time Software
Defect Prediction
- URL: http://arxiv.org/abs/2310.12289v1
- Date: Wed, 18 Oct 2023 19:42:34 GMT
- Title: Deep Incremental Learning of Imbalanced Data for Just-In-Time Software
Defect Prediction
- Authors: Yunhua Zhao, Hui Chen
- Abstract summary: This work stems from three observations on prior Just-In-Time Software Defect Prediction (JIT-SDP) models.
First, prior studies treat the JIT-SDP problem solely as a classification problem.
Second, prior JIT-SDP studies do not consider that class balancing processing may change the underlying characteristics of software changeset data.
- Score: 3.2022080692044352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work stems from three observations on prior Just-In-Time Software Defect
Prediction (JIT-SDP) models. First, prior studies treat the JIT-SDP problem
solely as a classification problem. Second, prior JIT-SDP studies do not
consider that class balancing processing may change the underlying
characteristics of software changeset data. Third, only a single source of
concept drift, the class imbalance evolution is addressed in prior JIT-SDP
incremental learning models.
We propose an incremental learning framework called CPI-JIT for JIT-SDP.
First, in addition to a classification modeling component, the framework
includes a time-series forecast modeling component in order to learn temporal
interdependent relationship in the changesets. Second, the framework features a
purposefully designed over-sampling balancing technique based on SMOTE and
Principal Curves called SMOTE-PC. SMOTE-PC preserves the underlying
distribution of software changeset data.
In this framework, we propose an incremental deep neural network model called
DeepICP. Via an evaluation using \numprojs software projects, we show that: 1)
SMOTE-PC improves the model's predictive performance; 2) to some software
projects it can be beneficial for defect prediction to harness temporal
interdependent relationship of software changesets; and 3) principal curves
summarize the underlying distribution of changeset data and reveals a new
source of concept drift that the DeepICP model is proposed to adapt to.
Related papers
- Feature Importance in the Context of Traditional and Just-In-Time Software Defect Prediction Models [5.1868909177638125]
This study developed defect prediction models incorporating the traditional and the Just-In-Time approaches from the publicly available dataset of the Apache Camel project.
A multi-layer deep learning algorithm was applied to these datasets in comparison with machine learning algorithms.
The deep learning algorithm achieved accuracies of 80% and 86%, with the area under receiving operator curve (AUC) scores of 66% and 78% for traditional and Just-In-Time defect prediction, respectively.
arXiv Detail & Related papers (2024-11-07T22:49:39Z) - Online model error correction with neural networks: application to the
Integrated Forecasting System [0.27930367518472443]
We develop a model error correction for the European Centre for Medium-Range Weather Forecasts using a neural network.
The network is pre-trained offline using a large dataset of operational analyses and analysis increments.
It is then integrated into the IFS within the Object-Oriented Prediction System (OOPS) so as to be used in data assimilation and forecast experiments.
arXiv Detail & Related papers (2024-03-06T13:36:31Z) - Class-Imbalanced Semi-Supervised Learning for Large-Scale Point Cloud
Semantic Segmentation via Decoupling Optimization [64.36097398869774]
Semi-supervised learning (SSL) has been an active research topic for large-scale 3D scene understanding.
The existing SSL-based methods suffer from severe training bias due to class imbalance and long-tail distributions of the point cloud data.
We introduce a new decoupling optimization framework, which disentangles feature representation learning and classifier in an alternative optimization manner to shift the bias decision boundary effectively.
arXiv Detail & Related papers (2024-01-13T04:16:40Z) - A study on the impact of pre-trained model on Just-In-Time defect
prediction [10.205110163570502]
We build six models: RoBERTaJIT, CodeBERTJIT, BARTJIT, PLBARTJIT, GPT2JIT, and CodeGPTJIT, each with a distinct pre-trained model as its backbone.
We investigate the performance of the models when using Commit code and Commit message as inputs, as well as the relationship between training efficiency and model distribution.
arXiv Detail & Related papers (2023-09-05T15:34:22Z) - Human-in-the-loop online just-in-time software defect prediction [6.35776510153759]
We propose Human-In-The-Loop (HITL) O-JIT-SDP that integrates feedback from SQA staff to enhance the prediction process.
We also introduce a performance evaluation framework that utilizes a k-fold distributed bootstrap method along with the Wilcoxon signed-rank test.
These advancements hold the potential to significantly enhance the value of O-JIT-SDP for industrial applications.
arXiv Detail & Related papers (2023-08-25T23:40:08Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - Online learning techniques for prediction of temporal tabular datasets
with regime changes [0.0]
We propose a modular machine learning pipeline for ranking predictions on temporal panel datasets.
The modularity of the pipeline allows the use of different models, including Gradient Boosting Decision Trees (GBDTs) and Neural Networks.
Online learning techniques, which require no retraining of models, can be used post-prediction to enhance the results.
arXiv Detail & Related papers (2022-12-30T17:19:00Z) - Parallel Successive Learning for Dynamic Distributed Model Training over
Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices.
We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions.
Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.