Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling
- URL: http://arxiv.org/abs/2208.12257v1
- Date: Thu, 25 Aug 2022 17:59:00 GMT
- Title: Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling
- Authors: Rui Wang and Zuxuan Wu and Dongdong Chen and Yinpeng Chen and Xiyang
Dai and Mengchen Liu and Luowei Zhou and Lu Yuan and Yu-Gang Jiang
- Abstract summary: Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
- Score: 125.95527079960725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have achieved top performance on major video
recognition benchmarks. Benefiting from the self-attention mechanism, these
models show stronger ability of modeling long-range dependencies compared to
CNN-based models. However, significant computation overheads, resulted from the
quadratic complexity of self-attention on top of a tremendous number of tokens,
limit the use of existing video transformers in applications with limited
resources like mobile devices. In this paper, we extend Mobile-Former to Video
Mobile-Former, which decouples the video architecture into a lightweight
3D-CNNs for local context modeling and a Transformer modules for global
interaction modeling in a parallel fashion. To avoid significant computational
cost incurred by computing self-attention between the large number of local
patches in videos, we propose to use very few global tokens (e.g., 6) for a
whole video in Transformers to exchange information with 3D-CNNs with a
cross-attention mechanism. Through efficient global spatial-temporal modeling,
Video Mobile-Former significantly improves the video recognition performance of
alternative lightweight baselines, and outperforms other efficient CNN-based
models at the low FLOP regime from 500M to 6G total FLOPs on various video
recognition tasks. It is worth noting that Video Mobile-Former is the first
Transformer-based video model which constrains the computational budget within
1G FLOPs.
Related papers
- VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.
VideoJAM achieves state-of-the-art performance in motion coherence.
These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z) - VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement [9.605944796068046]
We introduce VidFormer, a novel framework that integrates convolutional networks (CNNs) and models for r tasks.
Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-01-03T08:18:08Z) - Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling [14.450847211200292]
Video understanding has become increasingly important with the rise of multi-modality applications.
We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling.
C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information.
The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations.
arXiv Detail & Related papers (2024-10-19T05:50:00Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Real-Time Video Inference on Edge Devices via Adaptive Model Streaming [9.101956442584251]
Real-time video inference on edge devices like mobile phones and drones is challenging due to the high cost of Deep Neural Networks.
We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices.
arXiv Detail & Related papers (2020-06-11T17:25:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.