Fugu-MT 論文翻訳(概要): StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

論文の概要: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

arxiv url: http://arxiv.org/abs/2511.07278v1
Date: Mon, 10 Nov 2025 16:25:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:45.366431
Title: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Title（参考訳）: StreamKV: セグメントベースのKVキャッシュ検索と圧縮によるビデオ質問応答
Authors: Yilong Chen, Xiang Bai, Zhibin Wang, Chengyu Bai, Yuhan Dai, Ming Lu, Shanghang Zhang,
Abstract要約: 我々は,ビデオLLMと高度なKVキャッシュの検索と圧縮をシームレスに行うフレームワークである textbfStreamKV を提案する。公開StreamingVQAベンチマークの実験では、StreamKVが既存のオンラインビデオ-LLMを著しく上回っていることが示されている。
参考スコア（独自算出の注目度）: 95.59657871147846
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Large Language Models (Video-LLMs) have demonstrated significant potential in the areas of video captioning, search, and summarization. However, current Video-LLMs still face challenges with long real-world videos. Recent methods have introduced a retrieval mechanism that retrieves query-relevant KV caches for question answering, enhancing the efficiency and accuracy of long real-world videos. However, the compression and retrieval of KV caches are still not fully explored. In this paper, we propose \textbf{StreamKV}, a training-free framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression. Compared to previous methods that used uniform partitioning, StreamKV dynamically partitions video streams into semantic segments, which better preserves semantic information. For KV cache retrieval, StreamKV calculates a summary vector for each segment to retain segment-level information essential for retrieval. For KV cache compression, StreamKV introduces a guidance prompt designed to capture the key semantic elements within each segment, ensuring only the most informative KV caches are retained for answering questions. Moreover, StreamKV unifies KV cache retrieval and compression within a single module, performing both in a layer-adaptive manner, thereby further improving the effectiveness of streaming video question answering. Extensive experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs, achieving superior accuracy while substantially improving both memory efficiency and computational latency. The code has been released at https://github.com/sou1p0wer/StreamKV.
Abstract（参考訳）: ビデオ大言語モデル (Video Large Language Models, Video-LLMs) は,ビデオキャプション,検索,要約の領域において有意な可能性を証明している。しかし、現在のビデオ-LLMは、長い現実世界のビデオの課題に直面している。近年,質問応答のためのクエリ関連KVキャッシュの検索機構を導入し,リアルタイムビデオの効率と精度を向上させる手法が提案されている。しかし、KVキャッシュの圧縮と検索は、まだ完全には探索されていない。本稿では,ビデオLLMと高度なKVキャッシュの検索と圧縮をシームレスに行う訓練不要のフレームワークである \textbf{StreamKV} を提案する。均一なパーティショニングを使用した従来の方法と比較して、StreamKVはビデオストリームをセマンティックセグメントに動的に分割し、セマンティック情報をよりよく保存する。 KVキャッシュの検索では、StreamKVは各セグメントの要約ベクトルを算出し、検索に必要なセグメントレベルの情報を保持する。 KVキャッシュ圧縮のために、StreamKVは各セグメント内のキーセマンティック要素をキャプチャするために設計されたガイダンスプロンプトを導入する。さらに、StreamKVは、単一のモジュール内でKVキャッシュの検索と圧縮を統一し、層適応的に両方を実行することにより、ストリーミングビデオ質問応答の有効性をさらに向上する。公開StreamingVQAベンチマークの大規模な実験により、StreamKVは既存のOnline Video-LLMよりも大幅に優れており、メモリ効率と計算遅延の両方を大幅に改善しつつ、より優れた精度を実現している。コードはhttps://github.com/sou1p0wer/StreamKVでリリースされた。

論文の概要: StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

関連論文リスト