Related papers: MaiT: Leverage Attention Masks for More Efficient Image Transformers

MaiT: Leverage Attention Masks for More Efficient Image Transformers

URL: http://arxiv.org/abs/2207.03006v1
Date: Wed, 6 Jul 2022 22:42:34 GMT
Title: MaiT: Leverage Attention Masks for More Efficient Image Transformers
Authors: Ling Li, Ali Shafiee Ardestani, Joseph Hassoun
Abstract summary: With Masked attention image Transformer - MaiT, top-1 accuracy increases by up to 1.7% compared to CaiT with fewer parameters and FLOPs, and the throughput improves by up to 1.5X compared to Swin.
Score: 4.400421753565953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Though image transformers have shown competitive results with convolutional neural networks in computer vision tasks, lacking inductive biases such as locality still poses problems in terms of model efficiency especially for embedded applications. In this work, we address this issue by introducing attention masks to incorporate spatial locality into self-attention heads. Local dependencies are captured efficiently with masked attention heads along with global dependencies captured by unmasked attention heads. With Masked attention image Transformer - MaiT, top-1 accuracy increases by up to 1.7% compared to CaiT with fewer parameters and FLOPs, and the throughput improves by up to 1.5X compared to Swin. Encoding locality with attention masks is model agnostic, and thus it applies to monolithic, hierarchical, or other novel transformer architectures.

Related papers

StableMask: Refining Causal Masking in Decoder-only Transformer [22.75632485195928]
decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling. However, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information. We propose StableMask: a parameter-free method to address both limitations by refining the causal mask.
arXiv Detail & Related papers (2024-02-07T12:01:02Z)
Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling. Recent work has demonstrated adversarial attacks against conventional vision transformers. We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) HiViT enjoys both high efficiency and good performance in MIM. In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z)
BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z)
Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers. Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z)
MViT: Mask Vision Transformer for Facial Expression Recognition in the wild [77.44854719772702]
Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision. In this work, we first propose a novel pure transformer-based mask vision transformer (MViT) for FER in the wild. Our MViT outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with 89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable result on AffectNet-8 with 61.40%.
arXiv Detail & Related papers (2021-06-08T16:58:10Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
LocalViT: Analyzing Locality in Vision Transformers [101.53997555864822]
This paper studies the influence of locality mechanisms in vision transformers. We add locality to vision transformers into the feed-forward network. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.