On the Effectiveness of Out-of-Distribution Data in Self-Supervised
Long-Tail Learning
- URL: http://arxiv.org/abs/2306.04934v2
- Date: Wed, 12 Jul 2023 06:20:04 GMT
- Title: On the Effectiveness of Out-of-Distribution Data in Self-Supervised
Long-Tail Learning
- Authors: Jianhong Bai, Zuozhu Liu, Hualiang Wang, Jin Hao, Yang Feng, Huanpeng
Chu, Haoji Hu
- Abstract summary: We propose Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT)
We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning.
Our method significantly improves the performance of SSL on long-tailed datasets by a large margin.
- Score: 15.276356824489431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though Self-supervised learning (SSL) has been widely studied as a promising
technique for representation learning, it doesn't generalize well on
long-tailed datasets due to the majority classes dominating the feature space.
Recent work shows that the long-tailed learning performance could be boosted by
sampling extra in-domain (ID) data for self-supervised training, however,
large-scale ID data which can rebalance the minority classes are expensive to
collect. In this paper, we propose an alternative but easy-to-use and effective
solution, Contrastive with Out-of-distribution (OOD) data for Long-Tail
learning (COLT), which can effectively exploit OOD data to dynamically
re-balance the feature space. We empirically identify the counter-intuitive
usefulness of OOD samples in SSL long-tailed learning and principally design a
novel SSL method. Concretely, we first localize the `head' and `tail' samples
by assigning a tailness score to each OOD sample based on its neighborhoods in
the feature space. Then, we propose an online OOD sampling strategy to
dynamically re-balance the feature space. Finally, we enforce the model to be
capable of distinguishing ID and OOD samples by a distribution-level supervised
contrastive loss. Extensive experiments are conducted on various datasets and
several state-of-the-art SSL frameworks to verify the effectiveness of the
proposed method. The results show that our method significantly improves the
performance of SSL on long-tailed datasets by a large margin, and even
outperforms previous work which uses external ID data. Our code is available at
https://github.com/JianhongBai/COLT.
Related papers
- On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z) - EAT: Towards Long-Tailed Out-of-Distribution Detection [55.380390767978554]
This paper addresses the challenging task of long-tailed OOD detection.
The main difficulty lies in distinguishing OOD data from samples belonging to the tail classes.
We propose two simple ideas: (1) Expanding the in-distribution class space by introducing multiple abstention classes, and (2) Augmenting the context-limited tail classes by overlaying images onto the context-rich OOD data.
arXiv Detail & Related papers (2023-12-14T13:47:13Z) - Exploration and Exploitation of Unlabeled Data for Open-Set
Semi-Supervised Learning [130.56124475528475]
We address a complex but practical scenario in semi-supervised learning (SSL) named open-set SSL, where unlabeled data contain both in-distribution (ID) and out-of-distribution (OOD) samples.
Our proposed method achieves state-of-the-art in several challenging benchmarks, and improves upon existing SSL methods even when ID samples are totally absent in unlabeled data.
arXiv Detail & Related papers (2023-06-30T14:25:35Z) - Semi-supervised Learning with Deterministic Labeling and Large Margin
Projection [25.398314796157933]
The centrality and diversity of the labeled data are very influential to the performance of semi-supervised learning (SSL)
This study is to learn a kernelized large margin metric for a small amount of most stable and most divergent data that are recognized based on the OLF structure.
Attribute to this novel design, the accuracy and performance stableness of the SSL model based on OLF is significantly improved compared with its baseline methods.
arXiv Detail & Related papers (2022-08-17T04:09:35Z) - Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones.
We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z) - No Shifted Augmentations (NSA): compact distributions for robust
self-supervised Anomaly Detection [4.243926243206826]
Unsupervised Anomaly detection (AD) requires building a notion of normalcy, distinguishing in-distribution (ID) and out-of-distribution (OOD) data.
We investigate how the emph geometrical compactness of the ID feature distribution makes isolating and detecting outliers easier.
We propose novel architectural modifications to the self-supervised feature learning step, that enable such compact distributions for ID data to be learned.
arXiv Detail & Related papers (2022-03-19T15:55:32Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z) - Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning [54.85397562961903]
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available.
We address a more complex novel scenario named open-set SSL, where out-of-distribution (OOD) samples are contained in unlabeled data.
Our method achieves state-of-the-art results by successfully eliminating the effect of OOD samples.
arXiv Detail & Related papers (2020-07-22T10:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.