CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding
- URL: http://arxiv.org/abs/2511.13644v1
- Date: Mon, 17 Nov 2025 17:56:14 GMT
- Title: CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding
- Authors: Shrenik Patel, Daivik Patel,
- Abstract summary: CacheFlow is a training-free pipeline that pairs Dynamic Token Dropping with a long-term memory.<n>Online, per-frame processing makes our approach fundamentally suited for live streaming VQA.<n>At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.
Related papers
- Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z) - Flow caching for autoregressive video generation [72.10021661412364]
We present FlowCache, the first caching framework specifically designed for autoregressive video generation.<n>Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation.
arXiv Detail & Related papers (2026-02-11T13:11:04Z) - StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression [95.59657871147846]
We propose textbfStreamKV, a framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression.<n>Experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs.
arXiv Detail & Related papers (2025-11-10T16:25:03Z) - StreamingTOM: Streaming Token Compression for Efficient Video Understanding [6.9203477336374775]
Existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.<n>We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency.<n> Experiments demonstrate our method achieves $15.7times$ kv-cache compression, $1.2times$ lower peak memory and $2times$ faster TTFT compared to prior SOTA.
arXiv Detail & Related papers (2025-10-21T03:39:41Z) - StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding [14.50396424661833]
StreamMem is a query-agnostic KV cache memory mechanism for streaming video understanding.<n>It achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.
arXiv Detail & Related papers (2025-08-21T16:56:29Z) - InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding [26.408842739663346]
InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding.<n>It cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy-even in multi-turn dialogues.
arXiv Detail & Related papers (2025-06-18T02:22:14Z) - dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - PQCache: Product Quantization-based KVCache for Long Context LLM Inference [27.523568511043273]
Key-Value Cache (KVCache) is the intermediate representations of tokens within Large Language Models (LLMs)<n>We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency.<n>PQCache achieves both effectiveness and efficiency, with 4.60% score improvement over existing methods on InfiniteBench.
arXiv Detail & Related papers (2024-07-01T13:05:42Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.