Efficient Attention-free Video Shift Transformers
- URL: http://arxiv.org/abs/2208.11108v1
- Date: Tue, 23 Aug 2022 17:48:29 GMT
- Title: Efficient Attention-free Video Shift Transformers
- Authors: Adrian Bulat and Brais Martinez and Georgios Tzimiropoulos
- Abstract summary: This paper tackles the problem of efficient video recognition.
Video transformers have recently dominated the efficiency (top-1 accuracy vs FLOPs) spectrum.
We extend our formulation in the video domain to construct Video Affine-Shift Transformer.
- Score: 56.87581500474093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles the problem of efficient video recognition. In this area,
video transformers have recently dominated the efficiency (top-1 accuracy vs
FLOPs) spectrum. At the same time, there have been some attempts in the image
domain which challenge the necessity of the self-attention operation within the
transformer architecture, advocating the use of simpler approaches for token
mixing. However, there are no results yet for the case of video recognition,
where the self-attention operator has a significantly higher impact (compared
to the case of images) on efficiency. To address this gap, in this paper, we
make the following contributions: (a) we construct a highly efficient \&
accurate attention-free block based on the shift operator, coined Affine-Shift
block, specifically designed to approximate as closely as possible the
operations in the MHSA block of a Transformer layer. Based on our Affine-Shift
block, we construct our Affine-Shift Transformer and show that it already
outperforms all existing shift/MLP--based architectures for ImageNet
classification. (b) We extend our formulation in the video domain to construct
Video Affine-Shift Transformer (VAST), the very first purely attention-free
shift-based video transformer. (c) We show that VAST significantly outperforms
recent state-of-the-art transformers on the most popular action recognition
benchmarks for the case of models with low computational and memory footprint.
Code will be made available.
Related papers
- Hierarchical Separable Video Transformer for Snapshot Compressive Imaging [46.23615648331571]
Hierarchical Separable Video Transformer (HiSViT) is a reconstruction architecture without temporal aggregation.
HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network ( GSM-FFN)
Our method outperforms previous methods by $!>!0.5$ with comparable or fewer parameters and complexity.
arXiv Detail & Related papers (2024-07-16T17:35:59Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Token Shift Transformer for Video Classification [34.05954523287077]
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals.
Its encoders naturally contain computational intensive operations such as pair-wise self-attention.
This paper presents Token Shift Module (i.e., TokShift) for modeling temporal relations within each transformer encoder.
arXiv Detail & Related papers (2021-08-05T08:04:54Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.