Fugu-MT 論文翻訳(概要): Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

論文の概要: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

arxiv url: http://arxiv.org/abs/2510.17364v1
Date: Mon, 20 Oct 2025 10:04:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.394034
Title: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
Title（参考訳）: 効率的なストリーミングビデオLLMのための繰り返しアテンションに基づくトークン選択
Authors: Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi,
Abstract要約: 本稿では,標準ビデオ-LLMと互換性のあるトレーニングフリーな手法を提案する。注意に基づく選択によって、パフォーマンス損失を最小限に抑えながら、重要でない視覚トークンの95%を破棄することができます。本手法は,ストリーミングビデオベンチマークにおける最先端性能を実現する。
参考スコア（独自算出の注目度）: 7.06290511446344
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.
Abstract（参考訳）: ビデオ大言語モデル(Video Large Language Models, Video-LLMs)は、クエリ応答時にビデオにフルアクセスできることを条件として、ビデオのコンテキスト内での理解に長けている。しかし、これらのモデルは、1時間の動画をオンラインで処理しなければならないストリーミングシナリオにおいて課題に直面しており、質問にはタイムリーな応答が必要である。本稿では,3つの重要な概念を活かした,標準ビデオLLMと互換性のないトレーニングフリーアプローチを提案する。 1) LLMが出席した人物を特定するための視覚トークンの選択を指示し, 各ショートクリップの理解に寄与した。注意に基づく選択によって、パフォーマンスロスを最小限に抑えながら、重要でない視覚トークンの最大95%を破棄することができます。 2 過去の選択されたトークンの繰り返し処理により、各処理されたクリップの時間的コヒーレントな理解を発生させる。 3) 軽量で正確な応答に対するキャプションベースの質問応答。提案手法は,ストリーミングビデオベンチマークにおける最先端性能を実現し,効率と効率のバランスを崩す。

論文の概要: Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs

関連論文リスト