Teaching Matters: Investigating the Role of Supervision in Vision
Transformers
- URL: http://arxiv.org/abs/2212.03862v2
- Date: Wed, 5 Apr 2023 18:14:23 GMT
- Title: Teaching Matters: Investigating the Role of Supervision in Vision
Transformers
- Authors: Matthew Walmer, Saksham Suri, Kamal Gupta, Abhinav Shrivastava
- Abstract summary: We show that Vision Transformers (ViTs) learn a diverse range of behaviors in terms of their attention, representations, and downstream performance.
We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads.
Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method.
- Score: 32.79398665600664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have gained significant popularity in recent years
and have proliferated into many applications. However, their behavior under
different learning paradigms is not well explored. We compare ViTs trained
through different methods of supervision, and show that they learn a diverse
range of behaviors in terms of their attention, representations, and downstream
performance. We also discover ViT behaviors that are consistent across
supervision, including the emergence of Offset Local Attention Heads. These are
self-attention heads that attend to a token adjacent to the current token with
a fixed directional offset, a phenomenon that to the best of our knowledge has
not been highlighted in any prior work. Our analysis shows that ViTs are highly
flexible and learn to process local and global information in different orders
depending on their training method. We find that contrastive self-supervised
methods learn features that are competitive with explicitly supervised
features, and they can even be superior for part-level tasks. We also find that
the representations of reconstruction-based models show non-trivial similarity
to contrastive self-supervised models. Project website
(https://www.cs.umd.edu/~sakshams/vit_analysis) and code
(https://www.github.com/mwalmer-umd/vit_analysis) are publicly available.
Related papers
- On the Surprising Effectiveness of Attention Transfer for Vision Transformers [118.83572030360843]
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations.
We investigate this question and find that the features and representations learned during pre-training are not essential.
arXiv Detail & Related papers (2024-11-14T18:59:40Z) - Interpretability-Aware Vision Transformer [13.310757078491916]
Vision Transformers (ViTs) have become prominent models for solving various vision tasks.
We introduce a novel training procedure that inherently enhances model interpretability.
IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective.
arXiv Detail & Related papers (2023-09-14T21:50:49Z) - Uncovering the Hidden Dynamics of Video Self-supervised Learning under
Distribution Shifts [39.080610060557476]
We study the behavior of six popular self-supervised methods (v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various forms of natural distribution shift.
Our study uncovers a series of intriguing findings and interesting behaviors of VSSL methods.
arXiv Detail & Related papers (2023-06-03T06:10:20Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - An Empirical Study of Training Self-Supervised Visual Transformers [70.27107708555185]
We study the effects of several fundamental components for training self-supervised Visual Transformers.
We reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
arXiv Detail & Related papers (2021-04-05T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.