Related papers: Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling

URL: http://arxiv.org/abs/2208.12257v1
Date: Thu, 25 Aug 2022 17:59:00 GMT
Title: Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
Authors: Rui Wang and Zuxuan Wu and Dongdong Chen and Yinpeng Chen and Xiyang Dai and Mengchen Liu and Luowei Zhou and Lu Yuan and Yu-Gang Jiang
Abstract summary: Transformer-based models have achieved top performance on major video recognition benchmarks. Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
Score: 125.95527079960725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based models have achieved top performance on major video recognition benchmarks. Benefiting from the self-attention mechanism, these models show stronger ability of modeling long-range dependencies compared to CNN-based models. However, significant computation overheads, resulted from the quadratic complexity of self-attention on top of a tremendous number of tokens, limit the use of existing video transformers in applications with limited resources like mobile devices. In this paper, we extend Mobile-Former to Video Mobile-Former, which decouples the video architecture into a lightweight 3D-CNNs for local context modeling and a Transformer modules for global interaction modeling in a parallel fashion. To avoid significant computational cost incurred by computing self-attention between the large number of local patches in videos, we propose to use very few global tokens (e.g., 6) for a whole video in Transformers to exchange information with 3D-CNNs with a cross-attention mechanism. Through efficient global spatial-temporal modeling, Video Mobile-Former significantly improves the video recognition performance of alternative lightweight baselines, and outperforms other efficient CNN-based models at the low FLOP regime from 500M to 6G total FLOPs on various video recognition tasks. It is worth noting that Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.

Related papers

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model [60.171601995737646]
Mobile-VideoGPT is an efficient multimodal framework for video understanding. It consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM) Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second.
arXiv Detail & Related papers (2025-03-27T17:59:58Z)
VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement [9.605944796068046]
We introduce VidFormer, a novel framework that integrates convolutional networks (CNNs) and models for r tasks. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2025-01-03T08:18:08Z)
Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling [14.450847211200292]
Video understanding has become increasingly important with the rise of multi-modality applications. We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling. C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information. The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations.
arXiv Detail & Related papers (2024-10-19T05:50:00Z)
MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing. First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder. Second, we present MotionAura, a text-to-video generation framework. Third, we propose a spectral transformer-based denoising network. Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts. Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention. We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z)
Leveraging Local Temporal Information for Multimodal Scene Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively. Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks. We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z)
Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
Real-Time Video Inference on Edge Devices via Adaptive Model Streaming [9.101956442584251]
Real-time video inference on edge devices like mobile phones and drones is challenging due to the high cost of Deep Neural Networks. We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices.
arXiv Detail & Related papers (2020-06-11T17:25:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.