Deformable Video Transformer
- URL: http://arxiv.org/abs/2203.16795v1
- Date: Thu, 31 Mar 2022 04:52:27 GMT
- Title: Deformable Video Transformer
- Authors: Jue Wang and Lorenzo Torresani
- Abstract summary: We introduce the Deformable Video Transformer (DVT), which predicts a small subset of video patches to attend for each query location based on motion information.
Our model achieves higher accuracy at the same or lower computational cost, and it attains state-of-the-art results on four datasets.
- Score: 44.71254375663616
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video transformers have recently emerged as an effective alternative to
convolutional networks for action classification. However, most prior video
transformers adopt either global space-time attention or hand-defined
strategies to compare patches within and across frames. These fixed attention
schemes not only have high computational cost but, by comparing patches at
predetermined locations, they neglect the motion dynamics in the video. In this
paper, we introduce the Deformable Video Transformer (DVT), which dynamically
predicts a small subset of video patches to attend for each query location
based on motion information, thus allowing the model to decide where to look in
the video based on correspondences across frames. Crucially, these motion-based
correspondences are obtained at zero-cost from information stored in the
compressed format of the video. Our deformable attention mechanism is optimised
directly with respect to classification performance, thus eliminating the need
for suboptimal hand-design of attention strategies. Experiments on four
large-scale video benchmarks (Kinetics-400, Something-Something-V2,
EPIC-KITCHENS and Diving-48) demonstrate that, compared to existing video
transformers, our model achieves higher accuracy at the same or lower
computational cost, and it attains state-of-the-art results on these four
datasets.
Related papers
- Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Video Swin Transformer [41.41741134859565]
We advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off.
The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain.
Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.
arXiv Detail & Related papers (2021-06-24T17:59:46Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - VidTr: Video Transformer Without Convolutions [32.710988574799735]
We introduce Video Transformer (VidTr) with separable-attention fortemporal video classification.
VidTr is able to aggregate-temporal information via stacked attentions and provide better performance with higher efficiency.
arXiv Detail & Related papers (2021-04-23T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.