StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
- URL: http://arxiv.org/abs/2511.07278v1
- Date: Mon, 10 Nov 2025 16:25:03 GMT
- Title: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
- Authors: Yilong Chen, Xiang Bai, Zhibin Wang, Chengyu Bai, Yuhan Dai, Ming Lu, Shanghang Zhang,
- Abstract summary: We propose textbfStreamKV, a framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression.<n>Experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs.
- Score: 95.59657871147846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Large Language Models (Video-LLMs) have demonstrated significant potential in the areas of video captioning, search, and summarization. However, current Video-LLMs still face challenges with long real-world videos. Recent methods have introduced a retrieval mechanism that retrieves query-relevant KV caches for question answering, enhancing the efficiency and accuracy of long real-world videos. However, the compression and retrieval of KV caches are still not fully explored. In this paper, we propose \textbf{StreamKV}, a training-free framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression. Compared to previous methods that used uniform partitioning, StreamKV dynamically partitions video streams into semantic segments, which better preserves semantic information. For KV cache retrieval, StreamKV calculates a summary vector for each segment to retain segment-level information essential for retrieval. For KV cache compression, StreamKV introduces a guidance prompt designed to capture the key semantic elements within each segment, ensuring only the most informative KV caches are retained for answering questions. Moreover, StreamKV unifies KV cache retrieval and compression within a single module, performing both in a layer-adaptive manner, thereby further improving the effectiveness of streaming video question answering. Extensive experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs, achieving superior accuracy while substantially improving both memory efficiency and computational latency. The code has been released at https://github.com/sou1p0wer/StreamKV.
Related papers
- Efficient Remote Prefix Fetching with GPU-native Media ASICs [15.991394335072547]
Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating inference LLM.<n>Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits.<n>We propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs.
arXiv Detail & Related papers (2026-02-10T12:29:02Z) - DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding [92.59317281526239]
HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
arXiv Detail & Related papers (2026-01-21T07:26:15Z) - StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding [14.50396424661833]
StreamMem is a query-agnostic KV cache memory mechanism for streaming video understanding.<n>It achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.
arXiv Detail & Related papers (2025-08-21T16:56:29Z) - Sparse Attention across Multiple-context KV Cache [8.236266965773465]
Reusing historical Key-Value ( KV) Cache for improved inference efficiency has become a mainstream approach.<n>Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache.<n>This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache.
arXiv Detail & Related papers (2025-08-06T02:53:14Z) - FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression [18.12657364501536]
FAEDKV is a novel, training-free KV cache compression framework.<n>It preserves both early and recent contextual information.<n>Experiments on LongBench benchmark demonstrate FAEDKV's superiority over existing methods by up to 22%.
arXiv Detail & Related papers (2025-07-26T18:20:25Z) - dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z) - Streaming Video Question-Answering with In-context Video KV-Cache Retrieval [10.990431921021585]
We propose ReKV, a training-free approach that enables efficient streaming video question-answering (StreamingVQA)<n>Our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received.
arXiv Detail & Related papers (2025-03-01T15:53:33Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.