Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling
- URL: http://arxiv.org/abs/2208.12257v1
- Date: Thu, 25 Aug 2022 17:59:00 GMT
- Title: Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling
- Authors: Rui Wang and Zuxuan Wu and Dongdong Chen and Yinpeng Chen and Xiyang
Dai and Mengchen Liu and Luowei Zhou and Lu Yuan and Yu-Gang Jiang
- Abstract summary: Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
- Score: 125.95527079960725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have achieved top performance on major video
recognition benchmarks. Benefiting from the self-attention mechanism, these
models show stronger ability of modeling long-range dependencies compared to
CNN-based models. However, significant computation overheads, resulted from the
quadratic complexity of self-attention on top of a tremendous number of tokens,
limit the use of existing video transformers in applications with limited
resources like mobile devices. In this paper, we extend Mobile-Former to Video
Mobile-Former, which decouples the video architecture into a lightweight
3D-CNNs for local context modeling and a Transformer modules for global
interaction modeling in a parallel fashion. To avoid significant computational
cost incurred by computing self-attention between the large number of local
patches in videos, we propose to use very few global tokens (e.g., 6) for a
whole video in Transformers to exchange information with 3D-CNNs with a
cross-attention mechanism. Through efficient global spatial-temporal modeling,
Video Mobile-Former significantly improves the video recognition performance of
alternative lightweight baselines, and outperforms other efficient CNN-based
models at the low FLOP regime from 500M to 6G total FLOPs on various video
recognition tasks. It is worth noting that Video Mobile-Former is the first
Transformer-based video model which constrains the computational budget within
1G FLOPs.
Related papers
- Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling [14.450847211200292]
Video understanding has become increasingly important with the rise of multi-modality applications.
We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling.
C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information.
The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations.
arXiv Detail & Related papers (2024-10-19T05:50:00Z) - MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Real-Time Video Inference on Edge Devices via Adaptive Model Streaming [9.101956442584251]
Real-time video inference on edge devices like mobile phones and drones is challenging due to the high cost of Deep Neural Networks.
We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices.
arXiv Detail & Related papers (2020-06-11T17:25:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.