FRUGAL: Unlocking SSL for Software Analytics
- URL: http://arxiv.org/abs/2108.09847v1
- Date: Sun, 22 Aug 2021 21:15:27 GMT
- Title: FRUGAL: Unlocking SSL for Software Analytics
- Authors: Huy Tu and Tim Menzies
- Abstract summary: Unsupervised learning is a promising direction to learn hidden patterns within unlabelled data.
We present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme.
- Score: 17.63040340961143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard software analytics often involves having a large amount of data with
labels in order to commission models with acceptable performance. However,
prior work has shown that such requirements can be expensive, taking several
weeks to label thousands of commits, and not always available when traversing
new research problems and domains. Unsupervised Learning is a promising
direction to learn hidden patterns within unlabelled data, which has only been
extensively studied in defect prediction. Nevertheless, unsupervised learning
can be ineffective by itself and has not been explored in other domains (e.g.,
static analysis and issue close time).
Motivated by this literature gap and technical limitations, we present
FRUGAL, a tuned semi-supervised method that builds on a simple optimization
scheme that does not require sophisticated (e.g., deep learners) and expensive
(e.g., 100% manually labelled data) methods. FRUGAL optimizes the unsupervised
learner's configurations (via a simple grid search) while validating our design
decision of labelling just 2.5% of the data before prediction.
As shown by the experiments of this paper FRUGAL outperforms the
state-of-the-art adoptable static code warning recognizer and issue closed time
predictor, while reducing the cost of labelling by a factor of 40 (from 100% to
2.5%). Hence we assert that FRUGAL can save considerable effort in data
labelling especially in validating prior work or researching new problems.
Based on this work, we suggest that proponents of complex and expensive
methods should always baseline such methods against simpler and cheaper
alternatives. For instance, a semi-supervised learner like FRUGAL can serve as
a baseline to the state-of-the-art software analytics.
Related papers
- Boosting Semi-Supervised Learning by bridging high and low-confidence
predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL)
We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z) - Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised
Learning for Software Analytics [31.13621632964345]
Semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data.
This paper argues that such strong'' algorithms perform better than those standard, weaker, SSL algorithms.
arXiv Detail & Related papers (2023-02-03T20:59:09Z) - When Less is More: On the Value of "Co-training" for Semi-Supervised
Software Defect Predictors [15.862838836160634]
This paper applies a wide range of 55 semi-supervised learners to over 714 projects.
We find that semi-supervised "co-training methods" work significantly better than other approaches.
arXiv Detail & Related papers (2022-11-10T23:39:12Z) - An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning [58.59343434538218]
We propose a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective.
Our approach can be implemented in just few lines of code by only using off-the-shelf operations.
arXiv Detail & Related papers (2022-09-28T02:11:34Z) - MaxMatch: Semi-Supervised Learning with Worst-Case Consistency [149.03760479533855]
We propose a worst-case consistency regularization technique for semi-supervised learning (SSL)
We present a generalization bound for SSL consisting of the empirical loss terms observed on labeled and unlabeled training data separately.
Motivated by this bound, we derive an SSL objective that minimizes the largest inconsistency between an original unlabeled sample and its multiple augmented variants.
arXiv Detail & Related papers (2022-09-26T12:04:49Z) - Interpolation-based Contrastive Learning for Few-Label Semi-Supervised
Learning [43.51182049644767]
Semi-supervised learning (SSL) has long been proved to be an effective technique to construct powerful models with limited labels.
Regularization-based methods which force the perturbed samples to have similar predictions with the original ones have attracted much attention.
We propose a novel contrastive loss to guide the embedding of the learned network to change linearly between samples.
arXiv Detail & Related papers (2022-02-24T06:00:05Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain
Adaptation [87.60688582088194]
We propose a novel Self-Supervised Noisy Label Learning method.
Our method can easily achieve state-of-the-art results and surpass other methods by a very large margin.
arXiv Detail & Related papers (2021-02-23T10:51:45Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z) - Semi-Supervised Learning with Meta-Gradient [123.26748223837802]
We propose a simple yet effective meta-learning algorithm in semi-supervised learning.
We find that the proposed algorithm performs favorably against state-of-the-art methods.
arXiv Detail & Related papers (2020-07-08T08:48:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.