Related papers: When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

URL: http://arxiv.org/abs/2211.05920v2
Date: Thu, 15 Feb 2024 18:51:53 GMT
Title: When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors
Authors: Suvodeep Majumder, Joymallya Chakraborty and Tim Menzies
Abstract summary: This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches.
Score: 15.862838836160634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

Related papers

Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning. One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems. We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning [58.59343434538218]
We propose a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective. Our approach can be implemented in just few lines of code by only using off-the-shelf operations.
arXiv Detail & Related papers (2022-09-28T02:11:34Z)
Learning with Proper Partial Labels [87.65718705642819]
Partial-label learning is a kind of weakly-supervised learning with inexact labels. We show that this proper partial-label learning framework includes many previous partial-label learning settings. We then derive a unified unbiased estimator of the classification risk.
arXiv Detail & Related papers (2021-12-23T01:37:03Z)
FRUGAL: Unlocking SSL for Software Analytics [17.63040340961143]
Unsupervised learning is a promising direction to learn hidden patterns within unlabelled data. We present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme.
arXiv Detail & Related papers (2021-08-22T21:15:27Z)
OpenCoS: Contrastive Semi-supervised Learning for Handling Open-set Unlabeled Data [65.19205979542305]
Unlabeled data may include out-of-class samples in practice. OpenCoS is a method for handling this realistic semi-supervised learning scenario.
arXiv Detail & Related papers (2021-06-29T06:10:05Z)
Towards optimally abstaining from prediction [22.937799541125607]
A common challenge across all areas of machine learning is that training data is not distributed like test data. We consider a model where one may abstain from predicting, at a fixed cost. Our work builds on a recent abstention algorithm of Goldwasser, Kalais, and Montasser ( 2020) for transductive binary classification.
arXiv Detail & Related papers (2021-05-28T21:44:48Z)
Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images. We propose modifications and best practices aimed at minimizing human labeling effort. Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z)
Semi-Supervised Learning for Sparsely-Labeled Sequential Data: Application to Healthcare Video Processing [0.8312466807725921]
We propose a semi-supervised machine learning training strategy to improve event detection performance on sequential data. Our method uses noisy guesses of the events' end times to train event detection models. We show that our strategy outperforms conservative estimates by 12 points of mean average precision for MNIST, and 3.5 points for CIFAR.
arXiv Detail & Related papers (2020-11-28T09:54:44Z)
Don't Wait, Just Weight: Improving Unsupervised Representations by Learning Goal-Driven Instance Weights [92.16372657233394]
Self-supervised learning techniques can boost performance by learning useful representations from unlabelled data. We show that by learning Bayesian instance weights for the unlabelled data, we can improve the downstream classification accuracy. Our method, BetaDataWeighter is evaluated using the popular self-supervised rotation prediction task on STL-10 and Visual Decathlon.
arXiv Detail & Related papers (2020-06-22T15:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.