RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised
Learning
- URL: http://arxiv.org/abs/2106.07760v1
- Date: Mon, 14 Jun 2021 21:18:47 GMT
- Title: RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised
Learning
- Authors: Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, Rishabh Iyer
- Abstract summary: We propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning.
We show that RETRIEVE achieves a speedup of around 3X in the traditional SSL setting and achieves a speedup of 5X compared to state-of-the-art (SOTA) robust SSL algorithms.
- Score: 9.155410614399159
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Semi-supervised learning (SSL) algorithms have had great success in recent
years in limited labeled data regimes. However, the current state-of-the-art
SSL algorithms are computationally expensive and entail significant compute
time and energy requirements. This can prove to be a huge limitation for many
smaller companies and academic groups. Our main insight is that training on a
subset of unlabeled data instead of entire unlabeled data enables the current
SSL algorithms to converge faster, thereby reducing the computational costs
significantly. In this work, we propose RETRIEVE, a coreset selection framework
for efficient and robust semi-supervised learning. RETRIEVE selects the coreset
by solving a mixed discrete-continuous bi-level optimization problem such that
the selected coreset minimizes the labeled set loss. We use a one-step gradient
approximation and show that the discrete optimization problem is approximately
submodular, thereby enabling simple greedy algorithms to obtain the coreset. We
empirically demonstrate on several real-world datasets that existing SSL
algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve
a) faster training times, b) better performance when unlabeled data consists of
Out-of-Distribution(OOD) data and imbalance. More specifically, we show that
with minimal accuracy degradation, RETRIEVE achieves a speedup of around 3X in
the traditional SSL setting and achieves a speedup of 5X compared to
state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD
data.
Related papers
- Continual Learning on a Diet: Learning from Sparsely Labeled Streams Under Constrained Computation [123.4883806344334]
We study a realistic Continual Learning setting where learning algorithms are granted a restricted computational budget per time step while training.
We apply this setting to large-scale semi-supervised Continual Learning scenarios with sparse label rates.
Our extensive analysis and ablations demonstrate that DietCL is stable under a full spectrum of label sparsity, computational budget, and various other ablations.
arXiv Detail & Related papers (2024-04-19T10:10:39Z) - Can semi-supervised learning use all the data effectively? A lower bound
perspective [58.71657561857055]
We show that semi-supervised learning algorithms can leverage unlabeled data to improve over the labeled sample complexity of supervised learning algorithms.
Our work suggests that, while proving performance gains for SSL algorithms is possible, it requires careful tracking of constants.
arXiv Detail & Related papers (2023-11-30T13:48:50Z) - MaxMatch: Semi-Supervised Learning with Worst-Case Consistency [149.03760479533855]
We propose a worst-case consistency regularization technique for semi-supervised learning (SSL)
We present a generalization bound for SSL consisting of the empirical loss terms observed on labeled and unlabeled training data separately.
Motivated by this bound, we derive an SSL objective that minimizes the largest inconsistency between an original unlabeled sample and its multiple augmented variants.
arXiv Detail & Related papers (2022-09-26T12:04:49Z) - Interpolation-based Contrastive Learning for Few-Label Semi-Supervised
Learning [43.51182049644767]
Semi-supervised learning (SSL) has long been proved to be an effective technique to construct powerful models with limited labels.
Regularization-based methods which force the perturbed samples to have similar predictions with the original ones have attracted much attention.
We propose a novel contrastive loss to guide the embedding of the learned network to change linearly between samples.
arXiv Detail & Related papers (2022-02-24T06:00:05Z) - Unlabeled Data Help: Minimax Analysis and Adversarial Robustness [21.79888306754263]
Self-supervised learning (SSL) approaches successfully demonstrate the great potential of supplementing learning algorithms with additional unlabeled data.
It is still unclear whether the existing SSL algorithms can fully utilize the information of both labelled and unlabeled data.
This paper gives an affirmative answer for the reconstruction-based SSL algorithm citeplee 2020predicting under several statistical models.
arXiv Detail & Related papers (2022-02-14T19:24:43Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - Fast Distributionally Robust Learning with Variance Reduced Min-Max
Optimization [85.84019017587477]
Distributionally robust supervised learning is emerging as a key paradigm for building reliable machine learning systems for real-world applications.
Existing algorithms for solving Wasserstein DRSL involve solving complex subproblems or fail to make use of gradients.
We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable extra-gradient algorithms.
arXiv Detail & Related papers (2021-04-27T16:56:09Z) - Relieving the Plateau: Active Semi-Supervised Learning for a Better
Landscape [2.3046646540823916]
Semi-supervised learning (SSL) leverages unlabeled data that are more accessible than their labeled counterparts.
Active learning (AL) selects unlabeled instances to be annotated by a human-in-the-loop in hopes of better performance with less labeled data.
We propose convergence rate control (CRC), an AL algorithm that selects unlabeled data to improve the problem conditioning upon inclusion to the labeled set.
arXiv Detail & Related papers (2021-04-08T06:03:59Z) - GLISTER: Generalization based Data Subset Selection for Efficient and
Robust Learning [11.220278271829699]
We introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework.
We propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates.
We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms.
arXiv Detail & Related papers (2020-12-19T08:41:34Z) - On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points.
We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.