What do Vision Transformers Learn? A Visual Exploration
- URL: http://arxiv.org/abs/2212.06727v1
- Date: Tue, 13 Dec 2022 16:55:12 GMT
- Title: What do Vision Transformers Learn? A Visual Exploration
- Authors: Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu,
Micah Goldblum, Andrew Gordon Wilson, Tom Goldstein
- Abstract summary: Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
- Score: 68.50771218442776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) are quickly becoming the de-facto architecture for
computer vision, yet we understand very little about why they work and what
they learn. While existing studies visually analyze the mechanisms of
convolutional neural networks, an analogous exploration of ViTs remains
challenging. In this paper, we first address the obstacles to performing
visualizations on ViTs. Assisted by these solutions, we observe that neurons in
ViTs trained with language model supervision (e.g., CLIP) are activated by
semantic concepts rather than visual features. We also explore the underlying
differences between ViTs and CNNs, and we find that transformers detect image
background features, just like their convolutional counterparts, but their
predictions depend far less on high-frequency information. On the other hand,
both architecture types behave similarly in the way features progress from
abstract patterns in early layers to concrete objects in late layers. In
addition, we show that ViTs maintain spatial information in all layers except
the final layer. In contrast to previous works, we show that the last layer
most likely discards the spatial information and behaves as a learned global
pooling operation. Finally, we conduct large-scale visualizations on a wide
range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to
validate the effectiveness of our method.
Related papers
- Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects [30.09778169168547]
Vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings.
However, they exhibit surprising failures when performing tasks involving visual relations.
arXiv Detail & Related papers (2024-06-22T22:43:10Z) - Vision Transformer Visualization: What Neurons Tell and How Neurons
Behave? [33.87454837848252]
We propose an effective visualization technique to assist us in exposing the information carried in neurons and feature embeddings across the vision transformers (ViTs)
Our approach departs from the computational process of ViTs with a focus on visualizing the local and global information in input images and the latent feature embeddings at multiple levels.
Next, we develop a rigorous framework to perform effective visualizations across layers, exposing the effects of ViTs filters and grouping/clustering behaviors to object patches.
arXiv Detail & Related papers (2022-10-14T08:56:24Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Recent Advances in Vision Transformer: A Survey and Outlook of Recent
Work [1.6317061277457001]
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs)
As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships.
We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets.
arXiv Detail & Related papers (2022-03-03T06:17:03Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.