Space-time Mixing Attention for Video Transformer
- URL: http://arxiv.org/abs/2106.05968v2
- Date: Fri, 11 Jun 2021 12:06:04 GMT
- Title: Space-time Mixing Attention for Video Transformer
- Authors: Adrian Bulat and Juan-Manuel Perez-Rua and Swathikiran Sudhakaran and
Brais Martinez and Georgios Tzimiropoulos
- Abstract summary: We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
- Score: 55.50839896863275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is on video recognition using Transformers. Very recent attempts
in this area have demonstrated promising results in terms of recognition
accuracy, yet they have been also shown to induce, in many cases, significant
computational overheads due to the additional modelling of the temporal
information. In this work, we propose a Video Transformer model the complexity
of which scales linearly with the number of frames in the video sequence and
hence induces no overhead compared to an image-based Transformer model. To
achieve this, our model makes two approximations to the full space-time
attention used in Video Transformers: (a) It restricts time attention to a
local temporal window and capitalizes on the Transformer's depth to obtain full
temporal coverage of the video sequence. (b) It uses efficient space-time
mixing to attend jointly spatial and temporal locations without inducing any
additional cost on top of a spatial-only attention model. We also show how to
integrate 2 very lightweight mechanisms for global temporal-only attention
which provide additional accuracy improvements at minimal computational cost.
We demonstrate that our model produces very high recognition accuracy on the
most popular video recognition datasets while at the same time being
significantly more efficient than other Video Transformer models. Code will be
made available.
Related papers
- HumMUSS: Human Motion Understanding using State Space Models [6.821961232645209]
We propose a novel attention-free model for human motion understanding building upon recent advancements in state space models.
Our model supports both offline and real-time applications.
For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches.
arXiv Detail & Related papers (2024-04-16T19:59:21Z) - Eventful Transformers: Leveraging Temporal Redundancy in Vision
Transformers [27.029600581635957]
We describe a method for identifying and re-processing only those tokens that have changed significantly over time.
We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100)
arXiv Detail & Related papers (2023-08-25T17:10:12Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - Shifted Chunk Transformer for Spatio-Temporal Representational Learning [24.361059477031162]
We construct a shifted chunk Transformer with pure self-attention blocks.
This Transformer can learn hierarchical-temporal features from a tiny patch to a global video clip.
It outperforms state-of-the-art approaches on Kinetics, Kinetics-600, UCF101, and HMDB51.
arXiv Detail & Related papers (2021-08-26T04:34:33Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Video Swin Transformer [41.41741134859565]
We advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off.
The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain.
Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.
arXiv Detail & Related papers (2021-06-24T17:59:46Z) - Decoupled Spatial-Temporal Transformer for Video Inpainting [77.8621673355983]
Video aims to fill the given holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches.
Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance.
We propose a Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency.
arXiv Detail & Related papers (2021-04-14T05:47:46Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.