Addressing Sample Inefficiency in Multi-View Representation Learning
- URL: http://arxiv.org/abs/2312.10725v1
- Date: Sun, 17 Dec 2023 14:14:31 GMT
- Title: Addressing Sample Inefficiency in Multi-View Representation Learning
- Authors: Kumar Krishna Agrawal, Arna Ghosh, Adam Oberman, Blake Richards
- Abstract summary: Non-contrastive self-supervised learning (NC-SSL) methods have shown great promise for label-free representation learning in computer vision.
We provide theoretical insights on the implicit bias of the BarlowTwins and VICReg loss that can explain theses and guide the development of more principled recommendations.
- Score: 6.621303125642322
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Non-contrastive self-supervised learning (NC-SSL) methods like BarlowTwins
and VICReg have shown great promise for label-free representation learning in
computer vision. Despite the apparent simplicity of these techniques,
researchers must rely on several empirical heuristics to achieve competitive
performance, most notably using high-dimensional projector heads and two
augmentations of the same image. In this work, we provide theoretical insights
on the implicit bias of the BarlowTwins and VICReg loss that can explain these
heuristics and guide the development of more principled recommendations. Our
first insight is that the orthogonality of the features is more critical than
projector dimensionality for learning good representations. Based on this, we
empirically demonstrate that low-dimensional projector heads are sufficient
with appropriate regularization, contrary to the existing heuristic. Our second
theoretical insight suggests that using multiple data augmentations better
represents the desiderata of the SSL objective. Based on this, we demonstrate
that leveraging more augmentations per sample improves representation quality
and trainability. In particular, it improves optimization convergence, leading
to better features emerging earlier in the training. Remarkably, we demonstrate
that we can reduce the pretraining dataset size by up to 4x while maintaining
accuracy and improving convergence simply by using more data augmentations.
Combining these insights, we present practical pretraining recommendations that
improve wall-clock time by 2x and improve performance on CIFAR-10/STL-10
datasets using a ResNet-50 backbone. Thus, this work provides a theoretical
insight into NC-SSL and produces practical recommendations for enhancing its
sample and compute efficiency.
Related papers
- Investigating the Benefits of Projection Head for Representation Learning [11.20245728716827]
An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations.
The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better?
We show that implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers.
arXiv Detail & Related papers (2024-03-18T00:48:58Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Weighted Ensemble Self-Supervised Learning [67.24482854208783]
Ensembling has proven to be a powerful technique for boosting model performance.
We develop a framework that permits data-dependent weighted cross-entropy losses.
Our method outperforms both in multiple evaluation metrics on ImageNet-1K.
arXiv Detail & Related papers (2022-11-18T02:00:17Z) - Pretraining the Vision Transformer using self-supervised methods for
vision based Deep Reinforcement Learning [0.0]
We study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations.
Our results show that all methods are effective in learning useful representations and avoiding representational collapse.
The encoder pretrained with the temporal order verification task shows the best results across all experiments.
arXiv Detail & Related papers (2022-09-22T10:18:59Z) - Light-weight probing of unsupervised representations for Reinforcement Learning [20.638410483549706]
We study whether linear probing can be a proxy evaluation task for the quality of unsupervised RL representation.
We show that the probing tasks are strongly rank correlated with the downstream RL performance on the Atari100k Benchmark.
This provides a more efficient method for exploring the space of pretraining algorithms and identifying promising pretraining recipes.
arXiv Detail & Related papers (2022-08-25T21:08:01Z) - SiRi: A Simple Selective Retraining Mechanism for Transformer-based
Visual Grounding [131.0977050185209]
Selective Retraining (SiRi) can significantly outperform previous approaches on three popular benchmarks.
SiRi performs surprisingly superior even with limited training data.
We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.
arXiv Detail & Related papers (2022-07-27T07:01:01Z) - CCLF: A Contrastive-Curiosity-Driven Learning Framework for
Sample-Efficient Reinforcement Learning [56.20123080771364]
We develop a model-agnostic Contrastive-Curiosity-Driven Learning Framework (CCLF) for reinforcement learning.
CCLF fully exploit sample importance and improve learning efficiency in a self-supervised manner.
We evaluate this approach on the DeepMind Control Suite, Atari, and MiniGrid benchmarks.
arXiv Detail & Related papers (2022-05-02T14:42:05Z) - Visual Alignment Constraint for Continuous Sign Language Recognition [74.26707067455837]
Vision-based Continuous Sign Language Recognition aims to recognize unsegmented gestures from image sequences.
In this work, we revisit the overfitting problem in recent CTC-based CSLR works and attribute it to the insufficient training of the feature extractor.
We propose a Visual Alignment Constraint (VAC) to enhance the feature extractor with more alignment supervision.
arXiv Detail & Related papers (2021-04-06T07:24:58Z) - A Simple Framework for Contrastive Learning of Visual Representations [116.37752766922407]
This paper presents SimCLR: a simple framework for contrastive learning of visual representations.
We show that composition of data augmentations plays a critical role in defining effective predictive tasks.
We are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet.
arXiv Detail & Related papers (2020-02-13T18:50:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.