OmniNet: Omnidirectional Representations from Transformers
- URL: http://arxiv.org/abs/2103.01075v1
- Date: Mon, 1 Mar 2021 15:31:54 GMT
- Title: OmniNet: Omnidirectional Representations from Transformers
- Authors: Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen
Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler
- Abstract summary: This paper proposes Omnidirectional Representations from Transformers ( OmniNet)
In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network.
Experiments are conducted on autoregressive language modeling, Machine Translation, Long Range Arena (LRA), and Image Recognition.
- Score: 49.23834374054286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes Omnidirectional Representations from Transformers
(OmniNet). In OmniNet, instead of maintaining a strictly horizontal receptive
field, each token is allowed to attend to all tokens in the entire network.
This process can also be interpreted as a form of extreme or intensive
attention mechanism that has the receptive field of the entire width and depth
of the network. To this end, the omnidirectional attention is learned via a
meta-learner, which is essentially another self-attention based model. In order
to mitigate the computationally expensive costs of full receptive field
attention, we leverage efficient self-attention models such as kernel-based
(Choromanski et al.), low-rank attention (Wang et al.) and/or Big Bird (Zaheer
et al.) as the meta-learner. Extensive experiments are conducted on
autoregressive language modeling (LM1B, C4), Machine Translation, Long Range
Arena (LRA), and Image Recognition. The experiments show that OmniNet achieves
considerable improvements across these tasks, including achieving
state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr, and Long Range Arena.
Moreover, using omnidirectional representation in Vision Transformers leads to
significant improvements on image recognition tasks on both few-shot learning
and fine-tuning setups.
Related papers
- You Only Need Less Attention at Each Stage in Vision Transformers [19.660385306028047]
Vision Transformers (ViTs) capture the global information of images through self-attention modules.
We propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage.
Our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
arXiv Detail & Related papers (2024-06-01T12:49:16Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - Understanding The Robustness in Vision Transformers [140.1090560977082]
Self-attention may promote robustness through improved mid-level representations.
We propose a family of fully attentional networks (FANs) that strengthen this capability.
Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters.
arXiv Detail & Related papers (2022-04-26T17:16:32Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - Vision Transformers with Hierarchical Attention [61.16912607330001]
This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vision transformers.
We propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion.
We build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net.
arXiv Detail & Related papers (2021-06-06T17:01:13Z) - KVT: k-NN Attention for Boosting Vision Transformers [44.189475770152185]
We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers.
The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations.
We verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training.
arXiv Detail & Related papers (2021-05-28T06:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.