Fugu-MT 論文翻訳(概要): Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

論文の概要: Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

arxiv url: http://arxiv.org/abs/2510.22067v1
Date: Fri, 24 Oct 2025 23:04:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:14.80921
Title: Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
Title（参考訳）: 誘導のための迷路シフトの捕捉: VLM幻覚軽減のためのクロスモーダル核融合強化
Authors: Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas,
Abstract要約: 視覚言語モデル (VLM) はしばしば幻覚、すなわち視覚入力では実証できない内容を生成する。本稿では, Gaze Shift-Guided Cross-Modal Fusion Enhancement (GIFT) という手法を提案する。
参考スコア（独自算出の注目度）: 8.805397340243557
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.
Abstract（参考訳）: 視覚言語モデル(VLM)は、しばしば幻覚、すなわちテキスト入力または視覚入力で裏付けられないコンテンツを生成する。先行研究は主に、視覚的な入力よりも言語的な事前知識への過度な依存によるものである。いくつかの方法は、視覚的トークンの注意を注意スコアに比例して増幅することにより幻覚を緩和しようとする。しかし,これらの手法は,タスク非関連な視覚領域に注意を誤ることが多い視覚的注意シンク問題を見落とし,ユーザクエリに注意を向けることなく視覚的注意のみを増大させることにより,モーダル間融合バランスを無視する。これにより、ユーザのクエリを適切に解釈できないまま、誤った領域を増幅することが可能になる。これらの課題に対処するため,Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT) というシンプルな手法を提案する。 GIFTは、ユーザクエリの理解において、視覚的注意のポジティブな変化、すなわち「迷路シフト」を追跡することで、全体的視覚的サリエンシマップをプリコンプリートし、このマップを活用して、各デコードステップにおける視覚的情報とユーザクエリの両方に注意を向ける。これにより、無関係なトークンが最小限のシフトを示すのに対して、視覚的注意シンクの影響を低減し、よく統合された表現のためのバランスの取れたクロスモーダル融合を保証する。大規模な実験により、GIFTは、生成タスクと分類タスクの両方でVLMの幻覚を効果的に軽減し、グレディ復号よりも最大20.7%改善し、計算オーバーヘッドの少ない一般的な視覚言語のパフォーマンスを維持した。

論文の概要: Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

関連論文リスト