StreamingTOM: Streaming Token Compression for Efficient Video Understanding
- URL: http://arxiv.org/abs/2510.18269v1
- Date: Tue, 21 Oct 2025 03:39:41 GMT
- Title: StreamingTOM: Streaming Token Compression for Efficient Video Understanding
- Authors: Xueyi Chen, Keda Tao, Kele Shao, Huan Wang,
- Abstract summary: Existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.<n>We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency.<n> Experiments demonstrate our method achieves $15.7times$ kv-cache compression, $1.2times$ lower peak memory and $2times$ faster TTFT compared to prior SOTA.
- Score: 6.9203477336374775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
Related papers
- Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z) - PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective [59.24570811503256]
We propose PIO-FVLM to reduce redundant visual tokens in vision-models (VLMs) to accelerate inference.<n>The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment.<n>On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance.
arXiv Detail & Related papers (2026-02-04T15:33:10Z) - ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models [4.273730624882391]
Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens.<n>We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking)<n>We propose textbfConsensusDrop, a training-free framework that derives a emphconsensus ranking by reconciling vision encoder saliency with query-aware cross-attention.
arXiv Detail & Related papers (2026-02-01T00:28:55Z) - HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [92.59317281526239]
HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
arXiv Detail & Related papers (2026-01-21T07:26:15Z) - Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z) - CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding [0.0]
CacheFlow is a training-free pipeline that pairs Dynamic Token Dropping with a long-term memory.<n>Online, per-frame processing makes our approach fundamentally suited for live streaming VQA.<n>At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks.
arXiv Detail & Related papers (2025-11-17T17:56:14Z) - Attention Is All You Need for KV Cache in Diffusion LLMs [36.94369617373333]
Elastic-Cache performs adaptive, layer-aware cache updates for diffusion large language models.<n>Our method achieves significantly higher throughput ($6.8times$ on GSM8K) than existing confidence-based approaches.
arXiv Detail & Related papers (2025-10-16T17:59:48Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z) - Post-Training Sparse Attention with Double Sparsity [44.772593893621085]
"Double Sparsity" is a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access.
Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens.
With offloading, it achieves a decoding speed acceleration of 16.3$times$ compared to state-of-the-art solutions at a sequence length of 256K.
arXiv Detail & Related papers (2024-08-11T18:40:36Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.