Related papers: StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

URL: http://arxiv.org/abs/2512.12560v1
Date: Sun, 14 Dec 2025 05:35:11 GMT
Title: StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding
Authors: Xinqi Jin, Hanxun Yu, Bohan Yu, Kebin Liu, Jian Liu, Keda Tao, Yixuan Pei, Huan Wang, Fan Dang, Jiangchuan Liu, Weiqiang Wang,
Abstract summary: We propose token pruning as a means to reduce context length while retaining critical information.<n>Specifically, we introduce a novel redundancy metric, Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT)<n>We also design a masked pruning strategy that ensures only mutually unadjacent tokens are pruned.
Score: 29.539015046656615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in high GPU memory usage and computational latency. To address these challenges, we propose token pruning as a means to reduce context length while retaining critical information. Specifically, we introduce a novel redundancy metric, Maximum Similarity to Spatially Adjacent Video Tokens (MSSAVT), which accounts for both token similarity and spatial position. To mitigate the bidirectional dependency between pruning and redundancy, we further design a masked pruning strategy that ensures only mutually unadjacent tokens are pruned. We also integrate an existing temporal redundancy-based pruning method to eliminate temporal redundancy of the video modality. Experimental results on multiple online and offline video understanding benchmarks demonstrate that our method significantly improves the accuracy (i.e., by 4\% at most) while incurring a negligible pruning latency (i.e., less than 1ms). Our full implementation will be made publicly available.

Related papers

Stateful Token Reduction for Long-Video Hybrid VLMs [69.6930118088911]
We study query-conditioned token reduction for hybrid video vision-language models (VLMs)<n>We propose a low-to-high progressive reduction schedule and a unified language-aware scoring mechanism for both attention and Mamba blocks.<n>Under an aggressive compression setting, our approach delivers substantial prefilling speedups with near-baseline accuracy at test time.
arXiv Detail & Related papers (2026-02-27T08:11:06Z)
LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification [14.954035477725276]
We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding.<n>On LongVideoBench and Video-MME benchmarks, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.
arXiv Detail & Related papers (2026-02-26T23:28:13Z)
VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding [52.69880888587866]
Current Video Large Language Models (Video LLMs) typically encode frames via a encoder vision and employ an autoregressive (AR) LLM for understanding and generation.<n>We propose VidLaDA, a Diffusion Video LLM based on Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive modeling and decode tokens in parallel.<n>Experiments show VidLaDA rivals state-of-the-art AR baselines and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.
arXiv Detail & Related papers (2026-01-25T15:02:01Z)
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference [49.84148668264725]
We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages.<n>Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks.
arXiv Detail & Related papers (2025-10-20T17:35:47Z)
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference [5.146388234814547]
Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues.<n>We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches.<n>EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences.
arXiv Detail & Related papers (2025-10-16T12:34:38Z)
Dense Video Understanding with Gated Residual Tokenization [49.17263029080152]
High temporal resolution is essential for capturing fine-grained details in video understanding.<n>Current benchmarks rely mostly on low-frame-rate sampling.<n>Dense Video Understanding (DVU) enables high-FPS video comprehension by reducing both tokenization time and token overhead.
arXiv Detail & Related papers (2025-09-17T17:34:40Z)
Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration [24.337139909108117]
"Less is more" phenomenon where excessive frames can paradoxically degrade performance due to context dilution.<n>"Visual echoes" yield significant temporal redundancy, which we term 'visual echoes'<n>"AFP" employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives.<n>Our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%.
arXiv Detail & Related papers (2025-08-05T11:31:55Z)
HoliTom: Holistic Token Merging for Fast Video Large Language Models [32.620504076794795]
Video language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens.<n>We introduce HoliTom, a novel training-free holistic token framework.<n>We also introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning.
arXiv Detail & Related papers (2025-05-27T15:28:45Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams. To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.