Fugu-MT 論文翻訳(概要): HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

論文の概要: HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

arxiv url: http://arxiv.org/abs/2601.14724v2
Date: Mon, 26 Jan 2026 15:57:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-27 15:23:06.900903
Title: HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Title（参考訳）: HERMES: 効率的なストリーミングビデオ理解のための階層メモリとしてのKVキャッシュ
Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu,
Abstract要約: HERMESは、ビデオストリームのリアルタイムかつ正確な理解のためのトレーニング不要アーキテクチャである。 HermesはコンパクトなKVキャッシュを再利用し、リソース制約下で効率的なストリーミング理解を可能にする。 Hermesはすべてのベンチマークで優れた精度または同等の精度を実現しており、ストリーミングデータセットでは最大11.4%向上している。
参考スコア（独自算出の注目度）: 92.59317281526239
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)の最近の進歩は、オフラインビデオ理解において著しく改善されている。しかし、既存のモデルは安定した理解性能、リアルタイム応答、低GPUメモリオーバーヘッドを同時に維持するのに苦労しているため、これらの機能をストリーミングビデオインプットに拡張することは依然として難しい。この課題に対処するために、ビデオストリームのリアルタイムかつ正確な理解のための新しいトレーニング不要アーキテクチャであるHERMESを提案する。機械的注意調査に基づき,KVキャッシュを複数の粒度にまたがる映像情報をカプセル化する階層型メモリフレームワークとして概念化する。推論中、HERMESはコンパクトなKVキャッシュを再利用し、リソース制約下で効率的なストリーミング理解を可能にする。特に、HERMESはユーザクエリの到着時に補助的な計算を必要としないため、連続的なビデオストリームインタラクションに対するリアルタイム応答が保証され、従来のSOTAと比較して10$\times$ TTFTが高速になる。均一サンプリングと比較してビデオトークンを最大68%削減しても、HERMESはすべてのベンチマークで優れた精度または同等の精度を達成し、ストリーミングデータセットでは最大11.4%向上する。

論文の概要: HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

関連論文リスト