Self-supervised Learning is More Robust to Dataset Imbalance
- URL: http://arxiv.org/abs/2110.05025v1
- Date: Mon, 11 Oct 2021 06:29:56 GMT
- Title: Self-supervised Learning is More Robust to Dataset Imbalance
- Authors: Hong Liu, Jeff Z. HaoChen, Adrien Gaidon, Tengyu Ma
- Abstract summary: We investigate self-supervised learning under dataset imbalance.
Off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations.
We devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets.
- Score: 65.84339596595383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) is a scalable way to learn general visual
representations since it learns without labels. However, large-scale unlabeled
datasets in the wild often have long-tailed label distributions, where we know
little about the behavior of SSL. In this work, we systematically investigate
self-supervised learning under dataset imbalance. First, we find out via
extensive experiments that off-the-shelf self-supervised representations are
already more robust to class imbalance than supervised representations. The
performance gap between balanced and imbalanced pre-training with SSL is
significantly smaller than the gap with supervised learning, across sample
sizes, for both in-domain and, especially, out-of-domain evaluation. Second,
towards understanding the robustness of SSL, we hypothesize that SSL learns
richer features from frequent data: it may learn
label-irrelevant-but-transferable features that help classify the rare classes
and downstream tasks. In contrast, supervised learning has no incentive to
learn features irrelevant to the labels from frequent examples. We validate
this hypothesis with semi-synthetic experiments and theoretical analyses on a
simplified setting. Third, inspired by the theoretical insights, we devise a
re-weighted regularization technique that consistently improves the SSL
representation quality on imbalanced datasets with several evaluation criteria,
closing the small gap between balanced and imbalanced datasets with the same
number of examples.
Related papers
- A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights [67.72413262980272]
Severe data imbalance naturally exists among web-scale vision-language datasets.
We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning.
The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
arXiv Detail & Related papers (2024-05-31T17:57:24Z) - Reinforcement Learning-Guided Semi-Supervised Learning [20.599506122857328]
We propose a novel Reinforcement Learning Guided SSL method, RLGSSL, that formulates SSL as a one-armed bandit problem.
RLGSSL incorporates a carefully designed reward function that balances the use of labeled and unlabeled data to enhance generalization performance.
We demonstrate the effectiveness of RLGSSL through extensive experiments on several benchmark datasets and show that our approach achieves consistent superior performance compared to state-of-the-art SSL methods.
arXiv Detail & Related papers (2024-05-02T21:52:24Z) - Making Self-supervised Learning Robust to Spurious Correlation via
Learning-speed Aware Sampling [26.444935219428036]
Self-supervised learning (SSL) has emerged as a powerful technique for learning rich representations from unlabeled data.
In real-world settings, spurious correlations between some attributes (e.g. race, gender and age) and labels for downstream tasks often exist.
We propose a learning-speed aware SSL (LA-SSL) approach, in which we sample each training data with a probability that is inversely related to its learning speed.
arXiv Detail & Related papers (2023-11-27T22:52:45Z) - Does Decentralized Learning with Non-IID Unlabeled Data Benefit from
Self Supervision? [51.00034621304361]
We study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL)
We study the effectiveness of contrastive learning algorithms under decentralized learning settings.
arXiv Detail & Related papers (2022-10-20T01:32:41Z) - OpenLDN: Learning to Discover Novel Classes for Open-World
Semi-Supervised Learning [110.40285771431687]
Semi-supervised learning (SSL) is one of the dominant approaches to address the annotation bottleneck of supervised learning.
Recent SSL methods can effectively leverage a large repository of unlabeled data to improve performance while relying on a small set of labeled data.
This work introduces OpenLDN that utilizes a pairwise similarity loss to discover novel classes.
arXiv Detail & Related papers (2022-07-05T18:51:05Z) - Rethinking Re-Sampling in Imbalanced Semi-Supervised Learning [26.069534478556527]
Semi-Supervised Learning (SSL) has shown its strong ability in utilizing unlabeled data when labeled data is scarce.
Most SSL algorithms work under the assumption that the class distributions are balanced in both training and test sets.
In this work, we consider the problem of SSL on class-imbalanced data, which better reflects real-world situations.
arXiv Detail & Related papers (2021-06-01T03:58:18Z) - Distribution Aligning Refinery of Pseudo-label for Imbalanced
Semi-supervised Learning [126.31716228319902]
We develop Distribution Aligning Refinery of Pseudo-label (DARP) algorithm.
We show that DARP is provably and efficiently compatible with state-of-the-art SSL schemes.
arXiv Detail & Related papers (2020-07-17T09:16:05Z) - Class-Imbalanced Semi-Supervised Learning [33.94685366079589]
Semi-Supervised Learning (SSL) has achieved great success in overcoming the difficulties of labeling and making full use of unlabeled data.
We introduce a task of class-imbalanced semi-supervised learning (CISSL), which refers to semi-supervised learning with class-imbalanced data.
Our method shows better performance than the conventional methods in the CISSL environment.
arXiv Detail & Related papers (2020-02-17T07:48:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.