Related papers: No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

URL: http://arxiv.org/abs/2405.08344v1
Date: Tue, 14 May 2024 06:32:40 GMT
Title: No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding
Authors: Yingjie Zhai, Wenshuo Li, Yehui Tang, Xinghao Chen, Yunhe Wang,
Abstract summary: We propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as textitSqueezeTime, for mobile video understanding. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding.
Score: 38.60950616529459
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves $+1.2\%$ accuracy and $+80\%$ GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at https://github.com/xinghaochen/SqueezeTime and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime.

Related papers

DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification [4.973664680272982]
DejaVid is an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture.<n>We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task.<n>Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder.
arXiv Detail & Related papers (2025-06-14T17:39:03Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
VidTwin: Video VAE with Decoupled Structure and Dynamics [24.51768013474122]
VidTwin is a compact video autoencoder that decouples video into two distinct latent spaces. Structure latent vectors capture overall content and global movement, and Dynamics latent vectors represent fine-grained details and rapid movements. Experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality.
arXiv Detail & Related papers (2024-12-23T17:16:58Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding [41.69321731689751]
Video temporal grounding aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones. We propose Reversed Recurrent Tuning ($R2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding.
arXiv Detail & Related papers (2024-03-31T21:17:48Z)
What Can Simple Arithmetic Operations Do for Temporal Modeling? [100.39047523315662]
Temporal modeling plays a crucial role in understanding video content. Previous studies built complicated temporal relations through time sequence thanks to the development of powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling.
arXiv Detail & Related papers (2023-07-18T00:48:56Z)
Real-time Online Video Detection with Temporal Smoothing Transformers [4.545986838009774]
A good streaming recognition model captures both long-term dynamics and short-term changes of video. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel. We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
arXiv Detail & Related papers (2022-09-19T17:59:02Z)
DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning [29.182482776910152]
We coined Kinetic-range and short-range temporal modeling as key aspects of video recognition. In this paper, we introduce Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. Our proposed DSA module is shown to benefit various video recognition models significantly.
arXiv Detail & Related papers (2021-05-25T17:09:57Z)
Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling. Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map. Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.