Fugu-MT 論文翻訳(概要): Decouple and Cache: KV Cache Construction for Streaming Video Understanding

論文の概要: Decouple and Cache: KV Cache Construction for Streaming Video Understanding

arxiv url: http://arxiv.org/abs/2605.01858v1
Date: Sun, 03 May 2026 13:02:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.969115
Title: Decouple and Cache: KV Cache Construction for Streaming Video Understanding
Title（参考訳）: Decouple and Cache: KV Cache Construction for Streaming Video Understanding
Authors: Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener, Angela Yao,
Abstract要約: ストリーミングビデオ理解には、限られたメモリと計算量で無制限のビデオストリームを処理する必要がある。トレーニング不要なキャッシュ構築機構であるDecoupled Streaming Cache(DSCache)を提案する。 Streaming Video QAベンチマークの実験では、DSCacheの最先端のパフォーマンスが実証され、従来の手法よりも平均2.5%精度が向上した。
参考スコア（独自算出の注目度）: 53.55135022958052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache's state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.
Abstract（参考訳）: ビデオ理解のストリーミングには、制限されたメモリと計算による無制限のビデオストリームの処理が必要で、2つの重要な課題がある。まず、無制限ストリームには、新しいキーバリュー(KV)キャッシュを継続的に構築し、取り除く必要がある。第二に、非有界ストリームの収集とトレーニングのコストが高いため、モデルは長いストリームに一般化しながら短いシーケンスから学ぶ必要がある。既存のストリーミングのVideoVLLMは、無制限のビデオストリームにスケールしたり、キャッシュの再利用戦略に集中できないため、キャッシュ構築の影響は未調査のままである。本稿では,事前学習したオフラインモデルをストリーミング設定に適応させる,トレーニング不要なキャッシュ構築機構であるDecoupled Streaming Cache(DSCache)を提案する。 DSCacheは、必要に応じて個別のインスタントキャッシュを構築しながら、累積的な過去のKVキャッシュを保持し、最近の入力の情報性を維持するために、過去のキャッシュから切り離された。トレーニング期間を超えて位置外挿を可能にするため、DSCacheはさらに位置に依存しない符号化戦略を導入し、KVキャッシュが見えない位置をサポートすることを保証するとともに、位置オーバーフローを防止する。 Streaming Video QAベンチマークの実験では、DSCacheの最先端のパフォーマンスが実証され、従来の手法よりも平均2.5%精度が向上した。

論文の概要: Decouple and Cache: KV Cache Construction for Streaming Video Understanding

関連論文リスト