Related papers: Space-time Mixing Attention for Video Transformer

Space-time Mixing Attention for Video Transformer

URL: http://arxiv.org/abs/2106.05968v2
Date: Fri, 11 Jun 2021 12:06:04 GMT
Title: Space-time Mixing Attention for Video Transformer
Authors: Adrian Bulat and Juan-Manuel Perez-Rua and Swathikiran Sudhakaran and Brais Martinez and Georgios Tzimiropoulos
Abstract summary: We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
Score: 55.50839896863275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper is on video recognition using Transformers. Very recent attempts in this area have demonstrated promising results in terms of recognition accuracy, yet they have been also shown to induce, in many cases, significant computational overheads due to the additional modelling of the temporal information. In this work, we propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time being significantly more efficient than other Video Transformer models. Code will be made available.

Related papers

FullTransNet: Full Transformer with Local-Global Attention for Video Summarization [16.134118247239527]
We propose a transformer-like architecture named FullTransNet for video summarization.<n>It uses a full transformer with an encoder-decoder structure as an alternative architecture for video summarization.<n>Our model achieves F-scores of 54.4% and 63.9%, respectively, while maintaining relatively low computational and memory requirements.
arXiv Detail & Related papers (2025-01-01T16:07:27Z)
HumMUSS: Human Motion Understanding using State Space Models [6.821961232645209]
We propose a novel attention-free model for human motion understanding building upon recent advancements in state space models. Our model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches.
arXiv Detail & Related papers (2024-04-16T19:59:21Z)
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers [27.029600581635957]
We describe a method for identifying and re-processing only those tokens that have changed significantly over time. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100)
arXiv Detail & Related papers (2023-08-25T17:10:12Z)
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts. Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention. We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation. We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z)
Shifted Chunk Transformer for Spatio-Temporal Representational Learning [24.361059477031162]
We construct a shifted chunk Transformer with pure self-attention blocks. This Transformer can learn hierarchical-temporal features from a tiny patch to a global video clip. It outperforms state-of-the-art approaches on Kinetics, Kinetics-600, UCF101, and HMDB51.
arXiv Detail & Related papers (2021-08-26T04:34:33Z)
VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z)
Video Swin Transformer [41.41741134859565]
We advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.
arXiv Detail & Related papers (2021-06-24T17:59:46Z)
Decoupled Spatial-Temporal Transformer for Video Inpainting [77.8621673355983]
Video aims to fill the given holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches. Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance. We propose a Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency.
arXiv Detail & Related papers (2021-04-14T05:47:46Z)
A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras. We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z)
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.