Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
- URL: http://arxiv.org/abs/2602.18434v1
- Date: Fri, 20 Feb 2026 18:59:50 GMT
- Title: Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
- Authors: Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava,
- Abstract summary: Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
- Score: 50.30283773196725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
Related papers
- HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [92.59317281526239]
HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
arXiv Detail & Related papers (2026-01-21T07:26:15Z) - video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory [51.03819128505358]
Video-SALMONN S is first to process 3-hour videos at 1 FPS and 360p resolution under a fixed memory budget.<n>A test-time-training memory module continually updates token representations to capture long-range dependencies.<n>A prompt-dependent memory reader retrieves context-relevant content from fixed-size memory.
arXiv Detail & Related papers (2025-10-13T08:20:15Z) - MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding [13.02027465520324]
We propose MARC, which integrates structured retrieval and RL-based distillation.<n>MARC achieves near-baseline accuracy using only one frame's tokens.<n>This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings.
arXiv Detail & Related papers (2025-10-09T08:07:19Z) - Dense Video Understanding with Gated Residual Tokenization [49.17263029080152]
High temporal resolution is essential for capturing fine-grained details in video understanding.<n>Current benchmarks rely mostly on low-frame-rate sampling.<n>Dense Video Understanding (DVU) enables high-FPS video comprehension by reducing both tokenization time and token overhead.
arXiv Detail & Related papers (2025-09-17T17:34:40Z) - StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding [14.50396424661833]
StreamMem is a query-agnostic KV cache memory mechanism for streaming video understanding.<n>It achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.
arXiv Detail & Related papers (2025-08-21T16:56:29Z) - APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval [41.81696346270799]
Current large language models (LMs) struggle with hour-level video understanding.<n>bftextAdaptive textbfPivot MLbfVisual information textbfRetrieval (textbfAPVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information.
arXiv Detail & Related papers (2025-06-05T12:27:10Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
ReTaKe enables VideoLLMs to process 8 times longer frames (up to 2048), similar-sized models by 3-5% and even rivaling much larger ones on VideoMME, MLVU, LongVideoBench, and LVBench.<n>Our code is available at https://github.com/SCZwangxiao/video-ReTaKe.
arXiv Detail & Related papers (2024-12-29T15:42:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.