When Less is More: On the Value of "Co-training" for Semi-Supervised
Software Defect Predictors
- URL: http://arxiv.org/abs/2211.05920v2
- Date: Thu, 15 Feb 2024 18:51:53 GMT
- Title: When Less is More: On the Value of "Co-training" for Semi-Supervised
Software Defect Predictors
- Authors: Suvodeep Majumder, Joymallya Chakraborty and Tim Menzies
- Abstract summary: This paper applies a wide range of 55 semi-supervised learners to over 714 projects.
We find that semi-supervised "co-training methods" work significantly better than other approaches.
- Score: 15.862838836160634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Labeling a module defective or non-defective is an expensive task. Hence,
there are often limits on how much-labeled data is available for training.
Semi-supervised classifiers use far fewer labels for training models. However,
there are numerous semi-supervised methods, including self-labeling,
co-training, maximal-margin, and graph-based methods, to name a few. Only a
handful of these methods have been tested in SE for (e.g.) predicting defects
and even there, those methods have been tested on just a handful of projects.
This paper applies a wide range of 55 semi-supervised learners to over 714
projects. We find that semi-supervised "co-training methods" work significantly
better than other approaches. Specifically, after labeling, just
2.5% of data, then make predictions that are competitive to those using 100%
of the data.
That said, co-training needs to be used cautiously since the specific choice
of co-training methods needs to be carefully selected based on a user's
specific goals. Also, we warn that a commonly-used co-training method
("multi-view"-- where different learners get different sets of columns) does
not improve predictions (while adding too much to the run time costs 11 hours
vs. 1.8 hours).
It is an open question, worthy of future work, to test if these reductions
can be seen in other areas of software analytics. To assist with exploring
other areas, all the codes used are available at
https://github.com/ai-se/Semi-Supervised.
Related papers
- One-bit Supervision for Image Classification: Problem, Solution, and
Beyond [114.95815360508395]
This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification.
We propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm.
In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision.
arXiv Detail & Related papers (2023-11-26T07:39:00Z) - Adaptive Self-Training for Object Detection [13.07105239116411]
We introduce our method Self-Training for Object Detection (ASTOD)
ASTOD determines without cost a threshold value based directly on the ground value of the score histogram.
We use different views of the unlabeled images during the pseudo-labeling step to reduce the number of missed predictions.
arXiv Detail & Related papers (2022-12-07T15:10:40Z) - An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning [58.59343434538218]
We propose a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective.
Our approach can be implemented in just few lines of code by only using off-the-shelf operations.
arXiv Detail & Related papers (2022-09-28T02:11:34Z) - Learning with Proper Partial Labels [87.65718705642819]
Partial-label learning is a kind of weakly-supervised learning with inexact labels.
We show that this proper partial-label learning framework includes many previous partial-label learning settings.
We then derive a unified unbiased estimator of the classification risk.
arXiv Detail & Related papers (2021-12-23T01:37:03Z) - FRUGAL: Unlocking SSL for Software Analytics [17.63040340961143]
Unsupervised learning is a promising direction to learn hidden patterns within unlabelled data.
We present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme.
arXiv Detail & Related papers (2021-08-22T21:15:27Z) - OpenCoS: Contrastive Semi-supervised Learning for Handling Open-set
Unlabeled Data [65.19205979542305]
Unlabeled data may include out-of-class samples in practice.
OpenCoS is a method for handling this realistic semi-supervised learning scenario.
arXiv Detail & Related papers (2021-06-29T06:10:05Z) - Towards optimally abstaining from prediction [22.937799541125607]
A common challenge across all areas of machine learning is that training data is not distributed like test data.
We consider a model where one may abstain from predicting, at a fixed cost.
Our work builds on a recent abstention algorithm of Goldwasser, Kalais, and Montasser ( 2020) for transductive binary classification.
arXiv Detail & Related papers (2021-05-28T21:44:48Z) - Towards Good Practices for Efficiently Annotating Large-Scale Image
Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.
We propose modifications and best practices aimed at minimizing human labeling effort.
Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z) - Semi-Supervised Learning for Sparsely-Labeled Sequential Data:
Application to Healthcare Video Processing [0.8312466807725921]
We propose a semi-supervised machine learning training strategy to improve event detection performance on sequential data.
Our method uses noisy guesses of the events' end times to train event detection models.
We show that our strategy outperforms conservative estimates by 12 points of mean average precision for MNIST, and 3.5 points for CIFAR.
arXiv Detail & Related papers (2020-11-28T09:54:44Z) - Don't Wait, Just Weight: Improving Unsupervised Representations by
Learning Goal-Driven Instance Weights [92.16372657233394]
Self-supervised learning techniques can boost performance by learning useful representations from unlabelled data.
We show that by learning Bayesian instance weights for the unlabelled data, we can improve the downstream classification accuracy.
Our method, BetaDataWeighter is evaluated using the popular self-supervised rotation prediction task on STL-10 and Visual Decathlon.
arXiv Detail & Related papers (2020-06-22T15:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.