Fugu-MT 論文翻訳(概要): AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

論文の概要: AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

arxiv url: http://arxiv.org/abs/2603.28696v1
Date: Mon, 30 Mar 2026 17:14:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.534309
Title: AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding
Title（参考訳）: AdaptToken: MLLMロングビデオ理解のためのエントロピーに基づくAdaptive Token選択
Authors: Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, Marc Pollefeys,
Abstract要約: AdaptTokenは、MLLMの自己不確実性を、長ビデオトークン選択のためのグローバルコントロール信号に変換する、トレーニング不要のフレームワークである。常に精度(例えばQwen2.5-VL 7Bで平均で+6.7)を向上し、非常に長い入力(最大10Kフレーム)の恩恵を受け続けている。推論時間を同等のパフォーマンスで約半分削減する。
参考スコア（独自算出の注目度）: 81.07348307304547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)では、メモリコストとコンテキスト長の制限のため、長いビデオ理解が依然として困難である。以前のアプローチでは、短いクリップ内でフレームやトークンをスコアし、選択することで、これを緩和するが、原則化されたメカニズムは欠如している。 (i)遠距離ビデオクリップ間の関連性を比較して二十分な証拠が集められたときの処理を停止すること。本稿では,MLLMの自己不確かさを長ビデオトークン選択のためのグローバル制御信号に変換する,トレーニング不要のフレームワークであるAdaptTokenを提案する。 AdaptTokenは、ビデオをグループに分割し、各グループ内のランクトークンに相互注意を抽出し、モデルの応答エントロピーを使用して各グループのプロンプト関連性を推定する。このエントロピー信号は、グループ間でグローバルトークンの予算配分を可能にし、モデルが十分に確実になったときに残りのグループをスキップする早期停止(AdaptToken-Lite)をサポートする。 4つの長ビデオベンチマーク(VideoMME、LongVideoBench、LVBench、MLVU)と複数のベースMLLM(7B-72B)で、AdaptTokenは一貫して精度を向上し(例えばQwen2.5-VL 7Bで平均6.7)、非常に長い入力(最大10Kフレーム)の恩恵を受け続けている。プロジェクトページ: https://haozheqi.github.io/adapt-token

論文の概要: AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

関連論文リスト