Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning
- URL: http://arxiv.org/abs/2201.04676v1
- Date: Wed, 12 Jan 2022 20:02:32 GMT
- Title: Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning
- Authors: Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li,
Yu Qiao
- Abstract summary: Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format.
We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
- Score: 68.55487598401788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is a challenging task to learn rich and multi-scale spatiotemporal
semantics from high-dimensional videos, due to large local redundancy and
complex global dependency between video frames. The recent advances in this
research have been mainly driven by 3D convolutional neural networks and vision
transformers. Although 3D convolution can efficiently aggregate local context
to suppress local redundancy from a small 3D neighborhood, it lacks the
capability to capture global dependency because of the limited receptive field.
Alternatively, vision transformers can effectively capture long-range
dependency by self-attention mechanism, while having the limitation on reducing
local redundancy with blind similarity comparison among all the tokens in each
layer. Based on these observations, we propose a novel Unified transFormer
(UniFormer) which seamlessly integrates merits of 3D convolution and
spatiotemporal self-attention in a concise transformer format, and achieves a
preferable balance between computation and accuracy. Different from traditional
transformers, our relation aggregator can tackle both spatiotemporal redundancy
and dependency, by learning local and global token affinity respectively in
shallow and deep layers. We conduct extensive experiments on the popular video
benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1
accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than
other state-of-the-art methods. For Something-Something V1 and V2, our
UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1
accuracy respectively. Code is available at
https://github.com/Sense-X/UniFormer.
Related papers
- ACC-ViT : Atrous Convolution's Comeback in Vision Transformers [5.224344210588584]
We introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information.
We also propose a general vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks.
ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.
arXiv Detail & Related papers (2024-03-07T04:05:16Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - DualFormer: Local-Global Stratified Transformer for Efficient Video
Recognition [140.66371549815034]
We propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition.
We show that DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
arXiv Detail & Related papers (2021-12-09T03:05:19Z) - Token Shift Transformer for Video Classification [34.05954523287077]
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals.
Its encoders naturally contain computational intensive operations such as pair-wise self-attention.
This paper presents Token Shift Module (i.e., TokShift) for modeling temporal relations within each transformer encoder.
arXiv Detail & Related papers (2021-08-05T08:04:54Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.