VidTr: Video Transformer Without Convolutions
- URL: http://arxiv.org/abs/2104.11746v1
- Date: Fri, 23 Apr 2021 17:59:01 GMT
- Title: VidTr: Video Transformer Without Convolutions
- Authors: Xinyu Li, Yanyi Zhang, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio
Brattoli, Hao Chen, Ivan Marsic, Joseph Tighe
- Abstract summary: We introduce Video Transformer (VidTr) with separable-attention fortemporal video classification.
VidTr is able to aggregate-temporal information via stacked attentions and provide better performance with higher efficiency.
- Score: 32.710988574799735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Video Transformer (VidTr) with separable-attention for video
classification. Comparing with commonly used 3D networks, VidTr is able to
aggregate spatio-temporal information via stacked attentions and provide better
performance with higher efficiency. We first introduce the vanilla video
transformer and show that transformer module is able to perform spatio-temporal
modeling from raw pixels, but with heavy memory usage. We then present VidTr
which reduces the memory cost by 3.3$\times$ while keeping the same
performance. To further compact the model, we propose the standard deviation
based topK pooling attention, which reduces the computation by dropping
non-informative features. VidTr achieves state-of-the-art performance on five
commonly used dataset with lower computational requirement, showing both the
efficiency and effectiveness of our design. Finally, error analysis and
visualization show that VidTr is especially good at predicting actions that
require long-term temporal reasoning. The code and pre-trained weights will be
released.
Related papers
- A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - Spatiotemporal Self-attention Modeling with Temporal Patch Shift for
Action Recognition [34.98846882868107]
We propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition.
As a result, we can compute 3D self-attention memory using nearly the same complexity and cost as 2D self-attention.
arXiv Detail & Related papers (2022-07-27T02:47:07Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Deformable Video Transformer [44.71254375663616]
We introduce the Deformable Video Transformer (DVT), which predicts a small subset of video patches to attend for each query location based on motion information.
Our model achieves higher accuracy at the same or lower computational cost, and it attains state-of-the-art results on four datasets.
arXiv Detail & Related papers (2022-03-31T04:52:27Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.