Fugu-MT 論文翻訳(概要): StreamingTOM: Streaming Token Compression for Efficient Video Understanding

論文の概要: StreamingTOM: Streaming Token Compression for Efficient Video Understanding

arxiv url: http://arxiv.org/abs/2510.18269v1
Date: Tue, 21 Oct 2025 03:39:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.850817
Title: StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Title（参考訳）: StreamingTOM: 効率的なビデオ理解のためのストリーミングトークン圧縮
Authors: Xueyi Chen, Keda Tao, Kele Shao, Huan Wang,
Abstract要約: 既存のアプローチはLLM後のkv-cacheのみを規制し、コストのかかるLLM前のプリフィルは変わらない。 StreamingTOMは,LLM前とLLM後の両方のボトルネックに,予測可能なレイテンシで対処する,トレーニングフリーでプラグイン&プレイの2段階フレームワークです。実験では, 従来のSOTAと比較して, 15.7 時間で kv-cache 圧縮, 12 時間で低ピークメモリ, 2 時間で速い TTFT 圧縮を実現している。
参考スコア（独自算出の注目度）: 6.9203477336374775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
Abstract（参考訳）: オフライン処理とは異なり、ストリーミングビデオビジョン言語モデルは因果関係と蓄積という2つの基本的な制約に直面している。因果性は、オフラインメソッドが悪用する将来のフレームへのアクセスを防ぎ、一方、蓄積によってトークンが無制限に成長し、効率のボトルネックが生じる。しかし、既存のアプローチはLLM後のkv-cacheのみを規制し、コストのかかるLLM前のプリフィルは変わらないままである。 StreamingTOMは,LLM前とLLM後の両方のボトルネックに,予測可能なレイテンシで対処する,トレーニングフリーでプラグイン&プレイの2段階フレームワークです。因果的時間削減はフレーム単位の固定予算を課し、隣接するフレームの変更とトークンサリエンシに基づいてトークンを選択し、フレーム単位のプリフィルコストを大幅に削減する。 Online Quantized Memoryはトークンを4ビット形式で保存し、要求に応じて関連するグループを検索し、それらを復号化し、ストリーム長に関わらずアクティブなkv-cacheをバウンドする。実験では,従来のSOTAと比較して,kv-cache圧縮が15.7ドル,ピークメモリが12ドル,TTFTが2ドルであった。 StreamingTOMは、オフラインベンチマークで平均63.8\%、RVSで平均55.8\%/3.7ドルのトレーニングフリーメソッドで最先端の精度を維持している。これらの結果は,有界成長を伴う効率的なストリーミングビデオ理解のための2段階的アプローチの実践的メリットを浮き彫りにしている。

論文の概要: StreamingTOM: Streaming Token Compression for Efficient Video Understanding

関連論文リスト