Fugu-MT 論文翻訳(概要): Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

論文の概要: Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

arxiv url: http://arxiv.org/abs/2512.05546v1
Date: Fri, 05 Dec 2025 09:07:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-13 22:40:56.975242
Title: Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models
Title（参考訳）: Conscious Gaze:視覚・言語モデルにおける幻覚緩和のための適応的注意機構
Authors: Weijue Bu, Guan Yuan, Guixian Zhang,
Abstract要約: 本稿では,ゲーム理論の解釈可能性を動作可能な復号制御に変換する,学習不要な推論時間フレームワークを提案する。 Harsanyiインタラクション上に構築された認知デマンドセンサは、瞬時に視覚テキストのシナジーを推定する。 Focused Consensus 誘導モジュールは、テキスト先行に崩壊する前に、中間層注意を視覚トークンに選択的に向ける。
参考スコア（独自算出の注目度）: 2.5597374953396126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.
Abstract（参考訳）: VLM(Large Vision-Language Models)は、しばしばテキスト慣性を示し、注意が視覚的証拠から言語的先行性へと流れ、結果として対象の幻覚をもたらす。既存のデコード戦略は出力ロジットにのみ介入するため、内部推論のドリフトを補正することはできないが、ヒューリスティックなヘッド抑制やグローバルステアリングベクトルに基づく最近の内部制御手法では、原理的な基底が欠如している。本稿では,ゲーム理論の解釈可能性を動作可能な復号制御に変換するトレーニング不要な推論時間フレームワークであるConscious Gaze(CG-VLM)を紹介する。 Harsanyiインタラクション上に構築された認知的欲求センサは、瞬時に視覚テキストのシナジーを推定し、視覚的接地が必要な瞬間を特定する。この信号に基づいて、フォーカスド・コンセンサス誘導モジュールは、テキスト先行に崩壊する前に、中間層の注意を視覚トークンに選択的に向ける。 CG-VLMは、InstructBLIP、LLaVA、Qwen-VL、mPLUGにまたがるPOPEとCHAIRの最先端の成果を達成しつつ、一般的な能力を保ちながら、トークンレベルのセンシングが基礎知識を損なうことなく正確なコンテキスト認識の介入を可能にすることを実証する。

論文の概要: Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

関連論文リスト