Objectives Matter: Understanding the Impact of Self-Supervised
Objectives on Vision Transformer Representations
- URL: http://arxiv.org/abs/2304.13089v1
- Date: Tue, 25 Apr 2023 18:48:23 GMT
- Title: Objectives Matter: Understanding the Impact of Self-Supervised
Objectives on Vision Transformer Representations
- Authors: Shashank Shekhar, Florian Bordes, Pascal Vincent, Ari Morcos
- Abstract summary: We show that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features.
We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information.
- Score: 13.437097059358067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and
reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading
paradigms for self-supervised learning of vision transformers, but they differ
substantially in their transfer performance. Here, we aim to explain these
differences by analyzing the impact of these objectives on the structure and
transferability of the learned representations. Our analysis reveals that
reconstruction-based learning features are significantly dissimilar to
joint-embedding based learning features and that models trained with similar
objectives learn similar features even across architectures. These differences
arise early in the network and are primarily driven by attention and
normalization layers. We find that joint-embedding features yield better linear
probe transfer for classification because the different objectives drive
different distributions of information and invariances in the learned
representation. These differences explain opposite trends in transfer
performance for downstream tasks that require spatial specificity in features.
Finally, we address how fine-tuning changes reconstructive representations to
enable better transfer, showing that fine-tuning re-organizes the information
to be more similar to pre-trained joint embedding models.
Related papers
- Shortcut Learning Susceptibility in Vision Classifiers [3.004632712148892]
Shortcut learning is where machine learning models exploit spurious correlations in data instead of capturing meaningful features.
This phenomenon is prevalent across various machine learning applications, including vision, natural language processing, and speech recognition.
We systematically evaluate these architectures by introducing deliberate shortcuts into the dataset that are positionally correlated with class labels.
arXiv Detail & Related papers (2025-02-13T10:25:52Z) - Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient [0.49478969093606673]
We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory.
We study the development of internal structure in transformer language models during training.
arXiv Detail & Related papers (2024-10-03T20:51:02Z) - A Theoretical Analysis of Self-Supervised Learning for Vision Transformers [66.08606211686339]
Masked autoencoders (MAE) and contrastive learning (CL) capture different types of representations.
We study the training dynamics of one-layer softmax-based vision transformers (ViTs) on both MAE and CL objectives.
arXiv Detail & Related papers (2024-03-04T17:24:03Z) - ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers [7.725095281624494]
We evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative.
We observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry.
arXiv Detail & Related papers (2023-06-19T09:38:21Z) - Analyzing Multimodal Objectives Through the Lens of Generative Diffusion
Guidance [34.27851973031995]
We leverage the fact that classifier-guided diffusion models generate images that reflect the semantic signals provided by the classifier.
Specifically, we compare contrastive, matching and captioning loss in terms of their semantic signals, and introduce a simple baseline that not only supports our analyses but also improves the quality of generative guidance.
arXiv Detail & Related papers (2023-02-10T11:17:20Z) - Demystify Transformers & Convolutions in Modern Image Deep Networks [80.16624587948368]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study.
We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach.
Various STMs are integrated into this unified framework for comprehensive comparative analysis.
arXiv Detail & Related papers (2022-11-10T18:59:43Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Weak Augmentation Guided Relational Self-Supervised Learning [80.0680103295137]
We introduce a novel relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances.
Our proposed method employs sharpened distribution of pairwise similarities among different instances as textitrelation metric.
Experimental results show that our proposed ReSSL substantially outperforms the state-of-the-art methods across different network architectures.
arXiv Detail & Related papers (2022-03-16T16:14:19Z) - Why Do Self-Supervised Models Transfer? Investigating the Impact of
Invariance on Downstream Tasks [79.13089902898848]
Self-supervised learning is a powerful paradigm for representation learning on unlabelled images.
We show that different tasks in computer vision require features to encode different (in)variances.
arXiv Detail & Related papers (2021-11-22T18:16:35Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - What is being transferred in transfer learning? [51.6991244438545]
We show that when training from pre-trained weights, the model stays in the same basin in the loss landscape.
We present that when training from pre-trained weights, the model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space.
arXiv Detail & Related papers (2020-08-26T17:23:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.