Related papers: VideoLightFormer: Lightweight Action Recognition using Transformers

VideoLightFormer: Lightweight Action Recognition using Transformers

URL: http://arxiv.org/abs/2107.00451v1
Date: Thu, 1 Jul 2021 13:55:52 GMT
Title: VideoLightFormer: Lightweight Action Recognition using Transformers
Authors: Raivo Koot, Haiping Lu
Abstract summary: We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
Score: 8.871042314510788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient video action recognition remains a challenging problem. One large model after another takes the place of the state-of-the-art on the Kinetics dataset, but real-world efficiency evaluations are often lacking. In this work, we fill this gap and investigate the use of transformers for efficient action recognition. We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model. Existing methods often resort to one of the two extremes, where they either apply huge transformers to video features, or minimal transformers on highly pooled video features. Our method differs from them by keeping the transformer models small, but leveraging full spatiotemporal feature structure. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-Something-V2 (SSV2) datasets and find that it achieves a better mix of efficiency and accuracy than existing state-of-the-art models, apart from the Temporal Shift Module on SSV2.

Related papers

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video [15.952896909797728]
Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Recent research is shifting its focus toward parameter-efficient image-to-video adaptation. We present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks.
arXiv Detail & Related papers (2023-10-02T16:41:20Z)
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts. Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention. We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z)
Two-Stream Transformer Architecture for Long Video Understanding [5.001789577362836]
This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
arXiv Detail & Related papers (2022-08-02T21:03:48Z)
VDTR: Video Deblurring with Transformer [24.20183395758706]
Videoblurring is still an unsolved problem due to the challenging-temporal modeling process. This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
arXiv Detail & Related papers (2022-04-17T14:22:14Z)
Self-supervised Video Transformer [46.295395772938214]
From a given video, we create local and global views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of different views representing the same video to be intemporal. Our approach performs well on four action benchmarks and converges faster with small batch sizes.
arXiv Detail & Related papers (2021-12-02T18:59:02Z)
Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets. Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence. We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.