Fugu-MT 論文翻訳(概要): V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

論文の概要: V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

arxiv url: http://arxiv.org/abs/2512.12284v2
Date: Fri, 19 Dec 2025 08:02:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-22 13:33:13.42291
Title: V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
Title（参考訳）: V-Rex:動的KVキャッシュ検索によるリアルタイムストリーミングビデオLLM高速化
Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim,
Abstract要約: ビデオ大言語モデル(LLM)のストリーミングは、ビデオキャプション、質問応答、会話エージェント、拡張現実といったリアルタイムなマルチモーダルタスクにますます利用されている。これらのモデルは、キー値(KV)キャッシュが連続的なストリーミングビデオ入力によって大幅に増大するため、基本的なメモリと計算上の課題に直面している。我々は,ストリーミングビデオLLM推論におけるアルゴリズム的ボトルネックとハードウェア的ボトルネックに対処する,初のソフトウェアとハードウェアの共同設計アクセラレータであるV-Rexを提案する。
参考スコア（独自算出の注目度）: 1.677021230191566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.
Abstract（参考訳）: ビデオ大言語モデル(LLM)のストリーミングは、ビデオキャプション、質問応答、会話エージェント、拡張現実といったリアルタイムなマルチモーダルタスクにますます利用されている。しかし、これらのモデルはキー値(KV)キャッシュが連続的なストリーミングビデオ入力によって大幅に増大するため、基本的なメモリと計算上の課題に直面している。このプロセスは、ストリーミングビデオLLMのユニークな特徴である反復的なプリフィルステージを必要とする。反復的なプリフィルステージのため、広範な計算、実質的なデータ転送、精度の低下など、重大な制限に悩まされている。重要なのは、これらのモデルの主要なターゲットであるエッジデプロイメントにおいて、この問題が悪化していることだ。本研究では,ストリーミングビデオLLM推論におけるアルゴリズム的ボトルネックとハードウェア的ボトルネックに包括的に対処する,世界初のソフトウェアハードウェア共同設計アクセラレータであるV-Rexを提案する。 V-Rexはトレーニング不要な動的KVキャッシュ検索アルゴリズムであるReSVを導入した。 ReSVは時間的および空間的類似性に基づくトークンクラスタリングを利用して、ビデオフレーム間の過剰なKVキャッシュメモリを削減する。これらのアルゴリズムの利点を完全に実現するために、V-Rexは、ビットレベルと早期出力ベースのコンピューティングユニットを備えた、動的KVキャッシュ検索エンジン(DRE)を備えたコンパクトで低レイテンシのハードウェアアクセラレータを提供する。 V-Rexは前例のない3.9-8.3 FPSとエネルギー効率のよいストリーミングビデオLLM推論を実現している。 DREは2.2%の電力と2.0%の面積しか占めていないが、このシステムはAGX Orin GPUよりも1.9-19.7倍のスピードアップと3.1-18.5倍のエネルギー効率向上を実現している。この作業は、アルゴリズムとハードウェア間でKVキャッシュの検索に包括的に取り組み、リソース制約のあるエッジデバイス上でリアルタイムストリーミングビデオLLM推論を可能にする最初のものである。

論文の概要: V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

関連論文リスト