Data-Efficient Contrastive Self-supervised Learning: Most Beneficial
Examples for Supervised Learning Contribute the Least
- URL: http://arxiv.org/abs/2302.09195v5
- Date: Tue, 12 Mar 2024 19:22:20 GMT
- Title: Data-Efficient Contrastive Self-supervised Learning: Most Beneficial
Examples for Supervised Learning Contribute the Least
- Authors: Siddharth Joshi and Baharan Mirzasoleiman
- Abstract summary: Self-supervised learning (SSL) learns high-quality representations from large pools of unlabeled training data.
As datasets grow larger, it becomes crucial to identify the examples that contribute the most to learning such representations.
We prove that examples that contribute the most to contrastive SSL are those that have the most similar augmentations to other examples.
- Score: 14.516008359896421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) learns high-quality representations from large
pools of unlabeled training data. As datasets grow larger, it becomes crucial
to identify the examples that contribute the most to learning such
representations. This enables efficient SSL by reducing the volume of data
required. Nevertheless, quantifying the value of examples for SSL has remained
an open question. In this work, we address this problem for the first time, by
proving that examples that contribute the most to contrastive SSL are those
that have the most similar augmentations to other examples, in expectation. We
provide rigorous guarantees for the generalization performance of contrastive
learning on such subsets. Through extensive experiments, we show that we can
safely exclude 20% of examples from CIFAR100 and 40% from STL10 and
TinyImageNet, without affecting downstream task performance. In general,
subsets selected by our method outperform random subsets by over 3% across
these datasets. Interestingly, we also discover the subsets that contribute the
most to contrastive learning are those that contribute the least to supervised
learning. Code available at
https://github.com/bigml-cs-ucla/sas-data-efficient-contrastive-learning.
Related papers
- Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation [27.609748213840138]
In this work, we explore the impact of spurious features on Self-Supervised Learning (SSL) for visual representation learning.
We show that commonly used augmentations in SSL can cause undesired invariances in the image space.
We propose LateTVG to remove spurious information from these representations during pre-training, by regularizing later layers of the encoder via pruning.
arXiv Detail & Related papers (2024-05-28T18:42:13Z) - On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z) - On the Effectiveness of Out-of-Distribution Data in Self-Supervised
Long-Tail Learning [15.276356824489431]
We propose Contrastive with Out-of-distribution (OOD) data for Long-Tail learning (COLT)
We empirically identify the counter-intuitive usefulness of OOD samples in SSL long-tailed learning.
Our method significantly improves the performance of SSL on long-tailed datasets by a large margin.
arXiv Detail & Related papers (2023-06-08T04:32:10Z) - Towards Democratizing Joint-Embedding Self-Supervised Learning [17.59181163979478]
We show that it is possible to train SimCLR to learn useful representations, while using a single image patch as negative example.
In the hope to democratize JE-SSL, we introduce an optimized PyTorch library for SSL.
arXiv Detail & Related papers (2023-03-03T14:55:44Z) - Towards Realistic Semi-Supervised Learning [73.59557447798134]
We propose a novel approach to tackle SSL in open-world setting, where we simultaneously learn to classify known and unknown classes.
Our approach substantially outperforms the existing state-of-the-art on seven diverse datasets.
arXiv Detail & Related papers (2022-07-05T19:04:43Z) - Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones.
We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z) - Self-supervised Learning is More Robust to Dataset Imbalance [65.84339596595383]
We investigate self-supervised learning under dataset imbalance.
Off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations.
We devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets.
arXiv Detail & Related papers (2021-10-11T06:29:56Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z) - Rethinking Self-Supervised Learning: Small is Beautiful [30.809693803413445]
We propose scaled-down self-supervised learning (S3L), which include 3 parts: small resolution, small architecture and small data.
On a diverse set of datasets, S3L achieves higher accuracy consistently with much less training cost when compared to previous SSL learning paradigm.
arXiv Detail & Related papers (2021-03-25T01:48:52Z) - Relation-Guided Representation Learning [53.60351496449232]
We propose a new representation learning method that explicitly models and leverages sample relations.
Our framework well preserves the relations between samples.
By seeking to embed samples into subspace, we show that our method can address the large-scale and out-of-sample problem.
arXiv Detail & Related papers (2020-07-11T10:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.