Do We Really Need Explicit Position Encodings for Vision Transformers?
- URL: http://arxiv.org/abs/2102.10882v1
- Date: Mon, 22 Feb 2021 10:29:55 GMT
- Title: Do We Really Need Explicit Position Encodings for Vision Transformers?
- Authors: Xiangxiang Chu and Bo Zhang and Zhi Tian and Xiaolin Wei and Huaxia
Xia
- Abstract summary: We propose a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token.
Our new model with PEG is named Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length.
We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings.
- Score: 29.7662570764424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Almost all visual transformers such as ViT or DeiT rely on predefined
positional encodings to incorporate the order of each input token. These
encodings are often implemented as learnable fixed-dimension vectors or
sinusoidal functions of different frequencies, which are not possible to
accommodate variable-length input sequences. This inevitably limits a wider
application of transformers in vision, where many tasks require changing the
input size on-the-fly.
In this paper, we propose to employ a conditional position encoding scheme,
which is conditioned on the local neighborhood of the input token. It is
effortlessly implemented as what we call Position Encoding Generator (PEG),
which can be seamlessly incorporated into the current transformer framework.
Our new model with PEG is named Conditional Position encoding Visual
Transformer (CPVT) and can naturally process the input sequences of arbitrary
length. We demonstrate that CPVT can result in visually similar attention maps
and even better performance than those with predefined positional encodings. We
obtain state-of-the-art results on the ImageNet classification task compared
with visual Transformers to date. Our code will be made available at
https://github.com/Meituan-AutoML/CPVT .
Related papers
- Comparing Graph Transformers via Positional Encodings [11.5844121984212]
The distinguishing power of graph transformers is closely tied to the choice of positional encoding.
There are two primary types of positional encoding: absolute positional encodings (APEs) and relative positional encodings (RPEs)
We show that graph transformers using APEs and RPEs are equivalent in terms of distinguishing power.
arXiv Detail & Related papers (2024-02-22T01:07:48Z) - Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - SepTr: Separable Transformer for Audio Spectrogram Processing [74.41172054754928]
We propose a new vision transformer architecture called Separable Transformer (SepTr)
SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval.
We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
arXiv Detail & Related papers (2022-03-17T19:48:43Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Demystifying the Better Performance of Position Encoding Variants for
Transformer [12.503079503907989]
We show how to encode position and segment into Transformer models.
The proposed method performs on par with SOTA on GLUE, XTREME and WMT benchmarks while saving costs.
arXiv Detail & Related papers (2021-04-18T03:44:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.