Related papers: StreamingTOM: Streaming Token Compression for Efficient Video Understanding

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

URL: http://arxiv.org/abs/2510.18269v1
Date: Tue, 21 Oct 2025 03:39:41 GMT
Title: StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Authors: Xueyi Chen, Keda Tao, Kele Shao, Huan Wang,
Abstract summary: Existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.<n>We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency.<n> Experiments demonstrate our method achieves $15.7times$ kv-cache compression, $1.2times$ lower peak memory and $2times$ faster TTFT compared to prior SOTA.
Score: 6.9203477336374775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.

Related papers

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z)
PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective [59.24570811503256]
We propose PIO-FVLM to reduce redundant visual tokens in vision-models (VLMs) to accelerate inference.<n>The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment.<n>On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance.
arXiv Detail & Related papers (2026-02-04T15:33:10Z)
ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models [4.273730624882391]
Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens.<n>We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking)<n>We propose textbfConsensusDrop, a training-free framework that derives a emphconsensus ranking by reconciling vision encoder saliency with query-aware cross-attention.
arXiv Detail & Related papers (2026-02-01T00:28:55Z)
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [92.59317281526239]
HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
arXiv Detail & Related papers (2026-01-21T07:26:15Z)
Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z)
CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding [0.0]
CacheFlow is a training-free pipeline that pairs Dynamic Token Dropping with a long-term memory.<n>Online, per-frame processing makes our approach fundamentally suited for live streaming VQA.<n>At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks.
arXiv Detail & Related papers (2025-11-17T17:56:14Z)
Attention Is All You Need for KV Cache in Diffusion LLMs [36.94369617373333]
Elastic-Cache performs adaptive, layer-aware cache updates for diffusion large language models.<n>Our method achieves significantly higher throughput ($6.8times$ on GSM8K) than existing confidence-based approaches.
arXiv Detail & Related papers (2025-10-16T17:59:48Z)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z)
Post-Training Sparse Attention with Double Sparsity [44.772593893621085]
"Double Sparsity" is a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. With offloading, it achieves a decoding speed acceleration of 16.3$times$ compared to state-of-the-art solutions at a sequence length of 256K.
arXiv Detail & Related papers (2024-08-11T18:40:36Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams. To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.