VideoLightFormer: Lightweight Action Recognition using Transformers
- URL: http://arxiv.org/abs/2107.00451v1
- Date: Thu, 1 Jul 2021 13:55:52 GMT
- Title: VideoLightFormer: Lightweight Action Recognition using Transformers
- Authors: Raivo Koot, Haiping Lu
- Abstract summary: We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
- Score: 8.871042314510788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient video action recognition remains a challenging problem. One large
model after another takes the place of the state-of-the-art on the Kinetics
dataset, but real-world efficiency evaluations are often lacking. In this work,
we fill this gap and investigate the use of transformers for efficient action
recognition. We propose a novel, lightweight action recognition architecture,
VideoLightFormer. In a factorized fashion, we carefully extend the 2D
convolutional Temporal Segment Network with transformers, while maintaining
spatial and temporal video structure throughout the entire model. Existing
methods often resort to one of the two extremes, where they either apply huge
transformers to video features, or minimal transformers on highly pooled video
features. Our method differs from them by keeping the transformer models small,
but leveraging full spatiotemporal feature structure. We evaluate
VideoLightFormer in a high-efficiency setting on the temporally-demanding
EPIC-KITCHENS-100 and Something-Something-V2 (SSV2) datasets and find that it
achieves a better mix of efficiency and accuracy than existing state-of-the-art
models, apart from the Temporal Shift Module on SSV2.
Related papers
- ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video [15.952896909797728]
Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks.
Recent research is shifting its focus toward parameter-efficient image-to-video adaptation.
We present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks.
arXiv Detail & Related papers (2023-10-02T16:41:20Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - Two-Stream Transformer Architecture for Long Video Understanding [5.001789577362836]
This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features.
Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
arXiv Detail & Related papers (2022-08-02T21:03:48Z) - VDTR: Video Deblurring with Transformer [24.20183395758706]
Videoblurring is still an unsolved problem due to the challenging-temporal modeling process.
This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
arXiv Detail & Related papers (2022-04-17T14:22:14Z) - Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates.
Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal.
Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.