Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
- URL: http://arxiv.org/abs/2503.13724v1
- Date: Mon, 17 Mar 2025 21:13:48 GMT
- Title: Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition
- Authors: Shristi Das Biswas, Efstathia Soufleri, Arani Roy, Kaushik Roy,
- Abstract summary: We propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330times$ versus prior art.<n> Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance.
- Score: 6.168286187549952
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of $56$ while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330\times$ versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame -- P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ($0.73$J/V) and fast inference ($16$V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.
Related papers
- VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration [26.59510171451438]
Existing multi-sensor fusion research predominantly integrates complementary from multiple images rather than videos.
VideoFusion exploits cross-modal complementarity and temporal dynamics to generate context-temporally coherent videos.
Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios.
arXiv Detail & Related papers (2025-03-30T08:27:18Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Mixture of Nested Experts: Adaptive Processing of Visual Tokens [49.43920770789789]
Vision Transformer (ViT) based models fail to capitalize on inherent redundancy, leading to higher computational costs.
We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve.
We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2.
arXiv Detail & Related papers (2024-07-29T13:19:31Z) - Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner.
We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details.
The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Streaming Radiance Fields for 3D Video Synthesis [32.856346090347174]
We present an explicit-grid based method for reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes.
Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality.
arXiv Detail & Related papers (2022-10-26T16:23:02Z) - A Simple Baseline for Video Restoration with Grouped Spatial-temporal
Shift [36.71578909392314]
In this study, we propose a simple yet effective framework for video restoration.
Our approach is based on grouped spatial-temporal shift, which is a lightweight and straightforward technique.
Our framework outperforms the previous state-of-the-art method, while using less than a quarter of its computational cost.
arXiv Detail & Related papers (2022-06-22T02:16:47Z) - All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced
Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time.
This work introduces a true multi-frame interpolator.
It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.