Fugu-MT 論文翻訳(概要): Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

論文の概要: Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

arxiv url: http://arxiv.org/abs/2603.12254v1
Date: Thu, 12 Mar 2026 17:58:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.291642
Title: Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Title（参考訳）: Attend Before Attention: 自動回帰ゲームによる効率よくスケーラブルなビデオ理解
Authors: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin,
Abstract要約: AutoGazeは、ViTやMLLMで処理される前にパッチを削除する軽量モジュールである。ユーザが指定したエラー閾値内でビデオを再構成できる最小限のマルチスケールパッチを自動回帰的に選択する。ビジュアルトークンを4倍から100倍に減らし、ViTとMLLMを最大19倍に高速化し、1Kフレームの4K解像度ビデオにMLLMをスケールできるようにする。
参考スコア（独自算出の注目度）: 112.56180129013138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.
Abstract（参考訳）: MLLM(Multi-modal large language model)は、汎用的なビデオ理解が進んでいるが、長時間の高解像度ビデオに苦慮している。私たちは、ViTやMLLMで処理する前に冗長なパッチを削除する軽量モジュールであるAutoGazeを紹介します。次世代の予測と強化学習でトレーニングされたAutoGazeは、ユーザが指定したエラーしきい値内でビデオを再構成し、情報を保存しながら冗長性を排除できる、最小限のマルチスケールパッチを自動回帰的に選択する。実証的に、AutoGazeは視覚トークンを4x-100xに減らし、ViTとMLLMを最大19倍に高速化し、1Kフレームの4K解像度ビデオに拡張し、ビデオベンチマーク(ビデオMMEでは67.0%)で優れた結果を達成する。さらに,5分間の4K解像度ビデオを用いた最初の高解像度長ビデオQAベンチマークであるHLVidを紹介し,AutoGazeでスケールしたMLLMがベースラインよりも10.1%向上し,以前の最高のMLLMよりも4.5%向上した。プロジェクトページ: https://autogaze.github.io/.com

論文の概要: Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

関連論文リスト