HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
- URL: http://arxiv.org/abs/2601.14724v2
- Date: Mon, 26 Jan 2026 15:57:42 GMT
- Title: HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
- Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu,
- Abstract summary: HERMES is a training-free architecture for real-time and accurate understanding of video streams.<n>Hermes reuses a compact KV cache, enabling efficient streaming understanding under resource constraints.<n>Hermes achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
- Score: 92.59317281526239
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
Related papers
- Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory [50.30283773196725]
Existing approaches rely on key-value caching to accumulate frame-level details over time, but use a limited number of tokens per frame.<n>We propose scaling the token budget to enable more granular-temporal understanding and reasoning.
arXiv Detail & Related papers (2026-02-20T18:59:50Z) - StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression [95.59657871147846]
We propose textbfStreamKV, a framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression.<n>Experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs.
arXiv Detail & Related papers (2025-11-10T16:25:03Z) - StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding [14.50396424661833]
StreamMem is a query-agnostic KV cache memory mechanism for streaming video understanding.<n>It achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.
arXiv Detail & Related papers (2025-08-21T16:56:29Z) - LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval [13.891391928767195]
LiveVLM is a training-free framework specifically designed for streaming, online video understanding and real-time interaction.<n>LiveVLM constructs a streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs.<n>When a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information.
arXiv Detail & Related papers (2025-05-21T08:47:15Z) - An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.<n>We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.<n>With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z) - VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers [23.541896057977745]
VideoScan is an efficient vision-language model (VLM) inference framework for real-time video interaction.<n>VideoScan employs a single semantic carrier token to represent each frame.
arXiv Detail & Related papers (2025-03-12T13:30:40Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Streaming Video Question-Answering with In-context Video KV-Cache Retrieval [10.990431921021585]
We propose ReKV, a training-free approach that enables efficient streaming video question-answering (StreamingVQA)<n>Our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received.
arXiv Detail & Related papers (2025-03-01T15:53:33Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges [39.666361965650836]
VideoLLaMB is a framework for long video understanding.<n> SceneTiling algorithm segments videos into coherent semantic units.<n>VideoLLaMB processes up to 320 frames using a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.