Deep Incremental Learning of Imbalanced Data for Just-In-Time Software
Defect Prediction
- URL: http://arxiv.org/abs/2310.12289v1
- Date: Wed, 18 Oct 2023 19:42:34 GMT
- Title: Deep Incremental Learning of Imbalanced Data for Just-In-Time Software
Defect Prediction
- Authors: Yunhua Zhao, Hui Chen
- Abstract summary: This work stems from three observations on prior Just-In-Time Software Defect Prediction (JIT-SDP) models.
First, prior studies treat the JIT-SDP problem solely as a classification problem.
Second, prior JIT-SDP studies do not consider that class balancing processing may change the underlying characteristics of software changeset data.
- Score: 3.2022080692044352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work stems from three observations on prior Just-In-Time Software Defect
Prediction (JIT-SDP) models. First, prior studies treat the JIT-SDP problem
solely as a classification problem. Second, prior JIT-SDP studies do not
consider that class balancing processing may change the underlying
characteristics of software changeset data. Third, only a single source of
concept drift, the class imbalance evolution is addressed in prior JIT-SDP
incremental learning models.
We propose an incremental learning framework called CPI-JIT for JIT-SDP.
First, in addition to a classification modeling component, the framework
includes a time-series forecast modeling component in order to learn temporal
interdependent relationship in the changesets. Second, the framework features a
purposefully designed over-sampling balancing technique based on SMOTE and
Principal Curves called SMOTE-PC. SMOTE-PC preserves the underlying
distribution of software changeset data.
In this framework, we propose an incremental deep neural network model called
DeepICP. Via an evaluation using \numprojs software projects, we show that: 1)
SMOTE-PC improves the model's predictive performance; 2) to some software
projects it can be beneficial for defect prediction to harness temporal
interdependent relationship of software changesets; and 3) principal curves
summarize the underlying distribution of changeset data and reveals a new
source of concept drift that the DeepICP model is proposed to adapt to.
Related papers
- Online model error correction with neural networks: application to the
Integrated Forecasting System [0.27930367518472443]
We develop a model error correction for the European Centre for Medium-Range Weather Forecasts using a neural network.
The network is pre-trained offline using a large dataset of operational analyses and analysis increments.
It is then integrated into the IFS within the Object-Oriented Prediction System (OOPS) so as to be used in data assimilation and forecast experiments.
arXiv Detail & Related papers (2024-03-06T13:36:31Z) - Class-Imbalanced Semi-Supervised Learning for Large-Scale Point Cloud
Semantic Segmentation via Decoupling Optimization [64.36097398869774]
Semi-supervised learning (SSL) has been an active research topic for large-scale 3D scene understanding.
The existing SSL-based methods suffer from severe training bias due to class imbalance and long-tail distributions of the point cloud data.
We introduce a new decoupling optimization framework, which disentangles feature representation learning and classifier in an alternative optimization manner to shift the bias decision boundary effectively.
arXiv Detail & Related papers (2024-01-13T04:16:40Z) - A study on the impact of pre-trained model on Just-In-Time defect
prediction [10.205110163570502]
We build six models: RoBERTaJIT, CodeBERTJIT, BARTJIT, PLBARTJIT, GPT2JIT, and CodeGPTJIT, each with a distinct pre-trained model as its backbone.
We investigate the performance of the models when using Commit code and Commit message as inputs, as well as the relationship between training efficiency and model distribution.
arXiv Detail & Related papers (2023-09-05T15:34:22Z) - Human-in-the-loop online just-in-time software defect prediction [6.35776510153759]
We propose Human-In-The-Loop (HITL) O-JIT-SDP that integrates feedback from SQA staff to enhance the prediction process.
We also introduce a performance evaluation framework that utilizes a k-fold distributed bootstrap method along with the Wilcoxon signed-rank test.
These advancements hold the potential to significantly enhance the value of O-JIT-SDP for industrial applications.
arXiv Detail & Related papers (2023-08-25T23:40:08Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - Online learning techniques for prediction of temporal tabular datasets
with regime changes [0.0]
We propose a modular machine learning pipeline for ranking predictions on temporal panel datasets.
The modularity of the pipeline allows the use of different models, including Gradient Boosting Decision Trees (GBDTs) and Neural Networks.
Online learning techniques, which require no retraining of models, can be used post-prediction to enhance the results.
arXiv Detail & Related papers (2022-12-30T17:19:00Z) - Parallel Successive Learning for Dynamic Distributed Model Training over
Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices.
We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions.
Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z) - Learning Neural Models for Natural Language Processing in the Face of
Distributional Shift [10.990447273771592]
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications.
It builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time.
This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information.
It is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime
arXiv Detail & Related papers (2021-09-03T14:29:20Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.