Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition
- URL: http://arxiv.org/abs/2307.06947v4
- Date: Fri, 27 Oct 2023 15:16:53 GMT
- Title: Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition
- Authors: Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman
Khan, Mubarak Shah, Fahad Shahbaz Khan
- Abstract summary: Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
- Score: 112.66832145320434
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent video recognition models utilize Transformer models for long-range
spatio-temporal context modeling. Video transformer designs are based on
self-attention that can model global context at a high computational cost. In
comparison, convolutional designs for videos offer an efficient alternative but
lack long-range dependency modeling. Towards achieving the best of both
designs, this work proposes Video-FocalNet, an effective and efficient
architecture for video recognition that models both local and global contexts.
Video-FocalNet is based on a spatio-temporal focal modulation architecture that
reverses the interaction and aggregation steps of self-attention for better
efficiency. Further, the aggregation step and the interaction step are both
implemented using efficient convolution and element-wise multiplication
operations that are computationally less expensive than their self-attention
counterparts on video representations. We extensively explore the design space
of focal modulation-based spatio-temporal context modeling and demonstrate our
parallel spatial and temporal encoding design to be the optimal choice.
Video-FocalNets perform favorably well against the state-of-the-art
transformer-based models for video recognition on five large-scale datasets
(Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower
computational cost. Our code/models are released at
https://github.com/TalalWasim/Video-FocalNets.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z) - VDTR: Video Deblurring with Transformer [24.20183395758706]
Videoblurring is still an unsolved problem due to the challenging-temporal modeling process.
This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
arXiv Detail & Related papers (2022-04-17T14:22:14Z) - Deformable Video Transformer [44.71254375663616]
We introduce the Deformable Video Transformer (DVT), which predicts a small subset of video patches to attend for each query location based on motion information.
Our model achieves higher accuracy at the same or lower computational cost, and it attains state-of-the-art results on four datasets.
arXiv Detail & Related papers (2022-03-31T04:52:27Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.