Video Swin Transformer
- URL: http://arxiv.org/abs/2106.13230v1
- Date: Thu, 24 Jun 2021 17:59:46 GMT
- Title: Video Swin Transformer
- Authors: Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han
Hu
- Abstract summary: We advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off.
The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain.
Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.
- Score: 41.41741134859565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The vision community is witnessing a modeling shift from CNNs to
Transformers, where pure Transformer architectures have attained top accuracy
on the major video recognition benchmarks. These video models are all built on
Transformer layers that globally connect patches across the spatial and
temporal dimensions. In this paper, we instead advocate an inductive bias of
locality in video Transformers, which leads to a better speed-accuracy
trade-off compared to previous approaches which compute self-attention globally
even with spatial-temporal factorization. The locality of the proposed video
architecture is realized by adapting the Swin Transformer designed for the
image domain, while continuing to leverage the power of pre-trained image
models. Our approach achieves state-of-the-art accuracy on a broad range of
video recognition benchmarks, including on action recognition (84.9 top-1
accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less
pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1
accuracy on Something-Something v2). The code and models will be made publicly
available at https://github.com/SwinTransformer/Video-Swin-Transformer.
Related papers
- Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - Deformable Video Transformer [44.71254375663616]
We introduce the Deformable Video Transformer (DVT), which predicts a small subset of video patches to attend for each query location based on motion information.
Our model achieves higher accuracy at the same or lower computational cost, and it attains state-of-the-art results on four datasets.
arXiv Detail & Related papers (2022-03-31T04:52:27Z) - Co-training Transformer with Videos and Images Improves Action
Recognition [49.160505782802886]
In learning action recognition, models are typically pretrained on object recognition images, such as ImageNet, and later finetuned on target action recognition with videos.
This approach has achieved good empirical performance especially with recent transformer-based video architectures.
We show how video transformers benefit from joint training on diverse video datasets and label spaces.
arXiv Detail & Related papers (2021-12-14T05:41:39Z) - Improved Multiscale Vision Transformers for Classification and Detection [80.64111139883694]
We study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection.
We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition.
arXiv Detail & Related papers (2021-12-02T18:59:57Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.