Understanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive Fields
- URL: http://arxiv.org/abs/2305.04722v1
- Date: Mon, 8 May 2023 14:12:25 GMT
- Title: Understanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive Fields
- Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Sang Woo Kim
- Abstract summary: Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks.
We propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training.
The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets.
- Score: 7.58745191859815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViTs) that model an image as a sequence of partitioned
patches have shown notable performance in diverse vision tasks. Because
partitioning patches eliminates the image structure, to reflect the order of
patches, ViTs utilize an explicit component called positional embedding.
However, we claim that the use of positional embedding does not simply
guarantee the order-awareness of ViT. To support this claim, we analyze the
actual behavior of ViTs using an effective receptive field. We demonstrate that
during training, ViT acquires an understanding of patch order from the
positional embedding that is trained to be a specific pattern. Based on this
observation, we propose explicitly adding a Gaussian attention bias that guides
the positional embedding to have the corresponding pattern from the beginning
of training. We evaluated the influence of Gaussian attention bias on the
performance of ViTs in several image classification, object detection, and
semantic segmentation experiments. The results showed that proposed method not
only facilitates ViTs to understand images but also boosts their performance on
various datasets, including ImageNet, COCO 2017, and ADE20K.
Related papers
- Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers [5.359378066251386]
Self-supervised learning with vision transformers (ViTs) has proven effective for representation learning.
Existing ViT-based SSL architectures do not fully exploit the ViT backbone.
We introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively.
arXiv Detail & Related papers (2024-06-18T06:36:44Z) - Interpretability-Aware Vision Transformer [13.310757078491916]
Vision Transformers (ViTs) have become prominent models for solving various vision tasks.
We introduce a novel training procedure that inherently enhances model interpretability.
IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective.
arXiv Detail & Related papers (2023-09-14T21:50:49Z) - UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision
Transformer for Face Forgery Detection [52.91782218300844]
We propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT.
Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning.
arXiv Detail & Related papers (2022-10-23T15:24:47Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Position Labels for Self-Supervised Vision Transformer [1.3406858660972554]
Position encoding is important for vision transformer (ViT) to capture the spatial structure of the input image.
We propose two position labels dedicated to 2D images including absolute position and relative position.
Our position labels can be easily plugged into transformer, combined with the various current ViT variants.
arXiv Detail & Related papers (2022-06-10T10:29:20Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Understanding and Improving Robustness of Vision Transformers through
Patch-based Negative Augmentation [29.08732248577141]
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure.
We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics.
We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
arXiv Detail & Related papers (2021-10-15T04:53:18Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.