Dynamic Token Normalization Improves Vision Transformer
- URL: http://arxiv.org/abs/2112.02624v1
- Date: Sun, 5 Dec 2021 17:04:59 GMT
- Title: Dynamic Token Normalization Improves Vision Transformer
- Authors: Wenqi Shao, Yixiao Ge, Zhaoyang Zhang, Xuyuan Xu, Xiaogang Wang, Ying
Shan, Ping Luo
- Abstract summary: We propose a new normalizer, termed Dynamic Token Normalization (DTN)
DTN learns to normalize tokens in both intra-token and inter-token manners.
It consistently outperforms baseline model with minimal extra parameters and computational overhead.
- Score: 48.63155906080236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved
great success in various computer vision tasks, owing to their capability to
learn long-range contextual information. Layer Normalization (LN) is an
essential ingredient in these models. However, we found that the ordinary LN
makes tokens at different positions similar in magnitude because it normalizes
embeddings within each token. It is difficult for Transformers to capture
inductive bias such as the positional context in an image with LN. We tackle
this problem by proposing a new normalizer, termed Dynamic Token Normalization
(DTN), where normalization is performed both within each token (intra-token)
and across different tokens (inter-token). DTN has several merits. Firstly, it
is built on a unified formulation and thus can represent various existing
normalization methods. Secondly, DTN learns to normalize tokens in both
intra-token and inter-token manners, enabling Transformers to capture both the
global contextual information and the local positional context. {Thirdly, by
simply replacing LN layers, DTN can be readily plugged into various vision
transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer.
Extensive experiments show that the transformer equipped with DTN consistently
outperforms baseline model with minimal extra parameters and computational
overhead. For example, DTN outperforms LN by $0.5\%$ - $1.2\%$ top-1 accuracy
on ImageNet, by $1.2$ - $1.4$ box AP in object detection on COCO benchmark, by
$2.3\%$ - $3.9\%$ mCE in robustness experiments on ImageNet-C, and by $0.5\%$ -
$0.8\%$ accuracy in Long ListOps on Long-Range Arena.} Codes will be made
public at \url{https://github.com/wqshao126/DTN}
Related papers
- Role of Locality and Weight Sharing in Image-Based Tasks: A Sample Complexity Separation between CNNs, LCNs, and FCNs [42.551773746803946]
Vision tasks are characterized by the properties of locality and translation invariance.
The superior performance of convolutional neural networks (CNNs) on these tasks is widely attributed to the inductive bias of locality and weight sharing baked into their architecture.
Existing attempts to quantify the statistical benefits of these biases in CNNs over locally connected neural networks (LCNs) and fully connected neural networks (FCNs) fall into one of the following categories.
arXiv Detail & Related papers (2024-03-23T03:57:28Z) - Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token
Migration [138.24994198567794]
iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT)
Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss.
arXiv Detail & Related papers (2022-11-23T06:56:12Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - Global Interaction Modelling in Vision Transformer via Super Tokens [20.700750237972155]
Window-based local attention is one of the major techniques being adopted in recent works.
We present a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention.
In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5% accuracy.
arXiv Detail & Related papers (2021-11-25T16:22:57Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Person Re-Identification with a Locally Aware Transformer [9.023847175654602]
We propose a novel Locally Aware Transformer (LA-Transformer) that employs a Parts-based Convolution Baseline (PCB)-inspired strategy for aggregating globally enhanced local classification tokens.
LA-Transformer with blockwise fine-tuning achieves rank-1 accuracy of $98.27 %$ with standard deviation of $0.13$ on the Market-1501 and $98.7%$ with standard deviation of $0.1$ on the CUHK03 dataset respectively.
arXiv Detail & Related papers (2021-06-07T15:31:19Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.