Video Transformers: A Survey
- URL: http://arxiv.org/abs/2201.05991v1
- Date: Sun, 16 Jan 2022 07:31:55 GMT
- Title: Video Transformers: A Survey
- Authors: Javier Selva, Anders S. Johansen, Sergio Escalera, Kamal Nasrollahi,
Thomas B. Moeslund and Albert Clap\'es
- Abstract summary: We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
- Score: 42.314208650554264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have shown great success modeling long-range interactions.
Nevertheless, they scale quadratically with input length and lack inductive
biases. These limitations can be further exacerbated when dealing with the high
dimensionality of video. Proper modeling of video, which can span from seconds
to hours, requires handling long-range interactions. This makes Transformers a
promising tool for solving video related tasks, but some adaptations are
required. While there are previous works that study the advances of
Transformers for vision tasks, there is none that focus on in-depth analysis of
video-specific designs. In this survey we analyse and summarize the main
contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a
very widspread use of large CNN backbones to reduce dimensionality and a
predominance of patches and frames as tokens. Furthermore, we study how the
Transformer layer has been tweaked to handle longer sequences, generally by
reducing the number of tokens in single attention operation. Also, we analyse
the self-supervised losses used to train Video Transformers, which to date are
mostly constrained to contrastive approaches. Finally, we explore how other
modalities are integrated with video and conduct a performance comparison on
the most common benchmark for Video Transformers (i.e., action classification),
finding them to outperform 3D CNN counterparts with equivalent FLOPs and no
significant parameter increase.
Related papers
- vid-TLDR: Training Free Token merging for Light-weight Video Transformer [14.143681665368856]
Video Transformers suffer from heavy computational costs induced by the massive number of tokens across the entire video frames.
We propose training free token merging for lightweight video Transformer (vid-TLDR)
We introduce the saliency-aware token merging strategy by dropping the background tokens and sharpening the object scores.
arXiv Detail & Related papers (2024-03-20T07:15:22Z) - On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z) - Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding [62.502694656615496]
We present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT.
PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art.
We formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding.
arXiv Detail & Related papers (2022-08-25T17:59:29Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.