Related papers: FRUGAL: Unlocking SSL for Software Analytics

FRUGAL: Unlocking SSL for Software Analytics

URL: http://arxiv.org/abs/2108.09847v1
Date: Sun, 22 Aug 2021 21:15:27 GMT
Title: FRUGAL: Unlocking SSL for Software Analytics
Authors: Huy Tu and Tim Menzies
Abstract summary: Unsupervised learning is a promising direction to learn hidden patterns within unlabelled data. We present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme.
Score: 17.63040340961143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Standard software analytics often involves having a large amount of data with labels in order to commission models with acceptable performance. However, prior work has shown that such requirements can be expensive, taking several weeks to label thousands of commits, and not always available when traversing new research problems and domains. Unsupervised Learning is a promising direction to learn hidden patterns within unlabelled data, which has only been extensively studied in defect prediction. Nevertheless, unsupervised learning can be ineffective by itself and has not been explored in other domains (e.g., static analysis and issue close time). Motivated by this literature gap and technical limitations, we present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme that does not require sophisticated (e.g., deep learners) and expensive (e.g., 100% manually labelled data) methods. FRUGAL optimizes the unsupervised learner's configurations (via a simple grid search) while validating our design decision of labelling just 2.5% of the data before prediction. As shown by the experiments of this paper FRUGAL outperforms the state-of-the-art adoptable static code warning recognizer and issue closed time predictor, while reducing the cost of labelling by a factor of 40 (from 100% to 2.5%). Hence we assert that FRUGAL can save considerable effort in data labelling especially in validating prior work or researching new problems. Based on this work, we suggest that proponents of complex and expensive methods should always baseline such methods against simpler and cheaper alternatives. For instance, a semi-supervised learner like FRUGAL can serve as a baseline to the state-of-the-art software analytics.

Related papers

SILENT: A New Lens on Statistics in Software Timing Side Channels [10.872605368135343]
Recent attacks have challenged our understanding of what it means for code to execute in constant time on modern CPUs. We introduce a new algorithm for the analysis of timing measurements with strong, formal statistical guarantees. We demonstrate the necessity, effectiveness, and benefits of our approach on both synthetic benchmarks and real-world applications.
arXiv Detail & Related papers (2025-04-28T14:22:23Z)
Boosting Semi-Supervised Learning by bridging high and low-confidence predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL) We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z)
Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised Learning for Software Analytics [31.13621632964345]
Semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. This paper argues that such strong'' algorithms perform better than those standard, weaker, SSL algorithms.
arXiv Detail & Related papers (2023-02-03T20:59:09Z)
When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors [15.862838836160634]
This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches.
arXiv Detail & Related papers (2022-11-10T23:39:12Z)
An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning [58.59343434538218]
We propose a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective. Our approach can be implemented in just few lines of code by only using off-the-shelf operations.
arXiv Detail & Related papers (2022-09-28T02:11:34Z)
MaxMatch: Semi-Supervised Learning with Worst-Case Consistency [149.03760479533855]
We propose a worst-case consistency regularization technique for semi-supervised learning (SSL) We present a generalization bound for SSL consisting of the empirical loss terms observed on labeled and unlabeled training data separately. Motivated by this bound, we derive an SSL objective that minimizes the largest inconsistency between an original unlabeled sample and its multiple augmented variants.
arXiv Detail & Related papers (2022-09-26T12:04:49Z)
Interpolation-based Contrastive Learning for Few-Label Semi-Supervised Learning [43.51182049644767]
Semi-supervised learning (SSL) has long been proved to be an effective technique to construct powerful models with limited labels. Regularization-based methods which force the perturbed samples to have similar predictions with the original ones have attracted much attention. We propose a novel contrastive loss to guide the embedding of the learned network to change linearly between samples.
arXiv Detail & Related papers (2022-02-24T06:00:05Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain Adaptation [87.60688582088194]
We propose a novel Self-Supervised Noisy Label Learning method. Our method can easily achieve state-of-the-art results and surpass other methods by a very large margin.
arXiv Detail & Related papers (2021-02-23T10:51:45Z)
In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation. We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models. We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z)
Semi-Supervised Learning with Meta-Gradient [123.26748223837802]
We propose a simple yet effective meta-learning algorithm in semi-supervised learning. We find that the proposed algorithm performs favorably against state-of-the-art methods.
arXiv Detail & Related papers (2020-07-08T08:48:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.