Fugu-MT 論文翻訳(概要): GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

論文の概要: GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

arxiv url: http://arxiv.org/abs/2605.07817v1
Date: Fri, 08 May 2026 14:49:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.130149
Title: GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
Title（参考訳）: GazeVLM:マルチモーダル推論のための内部注意制御によるアクティブビジョン
Authors: Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti,
Abstract要約: 本稿では,このメタ認知的監視を内部化するマルチモーダルアーキテクチャであるGazeVLMを提案する。 VLMに自律的にガゼトークンを生成する権限を与えることで、GazeVLMは自身の因果注意マスクの上にトップダウン制御機構を確立する。このアーキテクチャにより、外部のエージェント・コンテントに頼ることなく、グローバルな空間認識と局所的な焦点推論の間を流動的に遷移することができる。
参考スコア（独自算出の注目度）: 12.050224516337098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{<LOOK>}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.
Abstract（参考訳）: 人間の視覚的推論は、メタ認知的制御がトップダウンのゴール指向の注意を駆動し、グローバルシーンの周囲の認識を維持しながら、タスク関連の詳細に焦点を動的にルーティングするプロセスであるアクティブビジョンによって制御される。対照的に、現代の視覚言語モデル(VLM)は視覚情報を受動的に処理し、空間的推論を希薄化し言語幻覚を誘発する巨大なトークンコンテキストの静的蓄積に依存している。本稿では,このメタ認知的監視を内部化するマルチモーダルアーキテクチャであるGazeVLMを提案する。 GazeVLMは、VLMに自律的にガゼトークン($\texttt{<LOOK>}$)を生成する権限を与えることで、独自の因果注意マスクの上にトップダウン制御機構を確立する。モデルは、その焦点意図を動的に予測し、無関係な視覚的特徴を阻害する連続的な抑制バイアスを誘発し、空間選択的注意を実践し、卵胞の固定をシミュレートする。ひとたび地元の推論が終わると、バイアスが立ち上がり、世界観をシームレスに回復する。このアーキテクチャにより、トリミングツールのような外部のエージェント機構に頼ることなく、局所的な視覚的パッチから派生した視覚的トークンでコンテキストウィンドウを膨らませることなく、グローバルな空間認識と局所的な焦点推論の間を流動的に遷移することができる。 4BパラメータGazeVLMは,有効グラウンド化を報奨するGRPO(Bespoke Group Relative Policy Optimization)プロシージャを用いて,パラメータクラスにおける最先端のVLMを約4%,HRBench-4kとHRBench-8kで5%以上の画像を扱うエージェント型マルチモーダルパイプラインを越え,高分解能なマルチモーダル推論性能を実現する。

論文の概要: GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

関連論文リスト