VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
- URL: http://arxiv.org/abs/2503.09387v2
- Date: Mon, 17 Mar 2025 03:09:39 GMT
- Title: VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
- Authors: Ruanjun Li, Yuedong Tan, Yuanming Shi, Jiawei Shao,
- Abstract summary: VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
- Score: 23.541896057977745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces VideoScan, an efficient vision-language model (VLM) inference framework designed for real-time video interaction that effectively comprehends and retains streamed video inputs while delivering rapid and accurate responses. A longstanding challenge in video understanding--particularly for long-term or real-time applications--stems from the substantial computational overhead caused by the extensive length of visual tokens. To address this, VideoScan employs a single semantic carrier token to represent each frame, progressively reducing computational and memory overhead during its two-phase inference process: prefilling and decoding. The embedding of the semantic carrier token is derived from an optimized aggregation of frame-level visual features, ensuring compact yet semantically rich representations. Critically, the corresponding key-value pairs are trained to retain contextual semantics from prior frames, enabling efficient memory management without sacrificing temporal coherence. During inference, the visual tokens of each frame are processed only once during the prefilling phase and subsequently discarded in the decoding stage, eliminating redundant computations. This design ensures efficient VLM inference even under stringent real-time constraints. Comprehensive experiments on diverse offline and online benchmarks demonstrate that LLaVA-Video, supported by our method, achieves up to $\sim 5\times$ and $1.29\times$ speedups compared to its original version and previous efficient streaming video understanding approaches, respectively. Crucially, these improvements are attained while maintaining competitive performance and ensuring stable GPU memory consumption (consistently $\sim 18$GB, independent of video duration).
Related papers
- An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.
We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.
With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z) - Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding [51.91097761028129]
We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding.
ProVideLLM integrates a multimodal cache configured to store two types of tokens.
By interleaving these tokens in our multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length.
arXiv Detail & Related papers (2025-04-10T17:13:08Z) - FastVID: Dynamic Density Pruning for Fast Video Large Language Models [38.267065642416554]
We propose Dynamic Density Pruning for Fast Video LLMs termed FastVID.
FastVID partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information.
Our method significantly reduces computational overhead while maintaining temporal and visual integrity.
arXiv Detail & Related papers (2025-03-14T08:33:08Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation.
SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences.
SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity.
In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame.
Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block.
We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.