DualFormer: Local-Global Stratified Transformer for Efficient Video
Recognition
- URL: http://arxiv.org/abs/2112.04674v1
- Date: Thu, 9 Dec 2021 03:05:19 GMT
- Title: DualFormer: Local-Global Stratified Transformer for Efficient Video
Recognition
- Authors: Yuxuan Liang, Pan Zhou, Roger Zimmermann, Shuicheng Yan
- Abstract summary: We propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition.
We show that DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
- Score: 140.66371549815034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While transformers have shown great potential on video recognition tasks with
their strong capability of capturing long-range dependencies, they often suffer
high computational costs induced by self-attention operation on the huge number
of 3D tokens in a video. In this paper, we propose a new transformer
architecture, termed DualFormer, which can effectively and efficiently perform
space-time attention for video recognition. Specifically, our DualFormer
stratifies the full space-time attention into dual cascaded levels, i.e., to
first learn fine-grained local space-time interactions among nearby 3D tokens,
followed by the capture of coarse-grained global dependencies between the query
token and the coarse-grained global pyramid contexts. Different from existing
methods that apply space-time factorization or restrict attention computations
within local windows for improving efficiency, our local-global stratified
strategy can well capture both short- and long-range spatiotemporal
dependencies, and meanwhile greatly reduces the number of keys and values in
attention computation to boost efficiency. Experimental results show the
superiority of DualFormer on five video benchmarks against existing methods. In
particular, DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on
Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times
fewer than existing methods with similar performances.
Related papers
- UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - An Efficient Spatio-Temporal Pyramid Transformer for Action Detection [40.68615998427292]
We present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) video framework for action detection.
Specifically, we propose to use local window attention to encode local-temporal rich-time representations in the early stages while applying global attention to capture long-term space-time dependencies in the later stages.
In this way, ourSTPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2022-07-21T12:38:05Z) - Points to Patches: Enabling the Use of Self-Attention for 3D Shape
Recognition [19.89482062012177]
We propose a two-stage Point Transformer-in-Transformer (Point-TnT) approach which combines local and global attention mechanisms.
Experiments on shape classification show that such an approach provides more useful features for downstream tasks than the baseline Transformer.
We also extend our method to feature matching for scene reconstruction, showing that it can be used in conjunction with existing scene reconstruction pipelines.
arXiv Detail & Related papers (2022-04-08T09:31:24Z) - Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format.
We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.