Fugu-MT 論文翻訳(概要): Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

論文の概要: Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

arxiv url: http://arxiv.org/abs/2605.20965v1
Date: Wed, 20 May 2026 09:50:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.609631
Title: Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
Title（参考訳）: LVLMにおける意図しない正確な視覚的証拠の発見--層間視覚的意図の相違による幻覚の緩和
Authors: Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia,
Abstract要約: その結果、LVLM(Large Vision-Language Models)は、正しい視覚的証拠に十分な注意を払っていると幻覚しがちであることがわかった。 ILVAD(Inter-Layer Visual Attention Discrepancy)に基づく視覚的エビデンスを高める新しい幻覚緩和法を提案する。私たちの方法は、トレーニングフリーでプラグアンドプレイです。
参考スコア（独自算出の注目度）: 42.615995536459224
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、幅広い視覚言語タスクにおいて顕著な性能を示す。この進歩にもかかわらず、彼らはまだ幻覚を起こす傾向にあり、視覚的内容と矛盾する反応を生み出す。本研究では,LVLMが正しい視覚的証拠に十分な注意を払っていると幻覚する傾向にあり,生成過程中に徐々に忘れる傾向にあることを示す。実験により,LVLMは視覚的証拠に十分対応していないものの,特定の層における正しい視覚的証拠に対する感受性を示し,層間差が顕著であることがわかった。そこで本研究では,ILVAD(Inter-Layer Visual Attention Discrepancy)に基づく視覚的エビデンスを高める新たな幻覚緩和法を提案する。具体的には、初期生成トークンから層間の視覚トークンへの注意重みを求め、視覚的証拠として繰り返し活性化されるトークンを識別し、サリエンシマップを作成する。そこで私たちは,視覚的忘れを少なくするために,サリエンシマップを通じて生成中の視覚的エビデンスに注意を向ける。さらに,視覚的エビデンスに強く根ざしたテキストトークンを選択・強調するために,サリエンシマップを利用して生成したテキストの注意点と視覚的エビデンスを取得する。私たちの方法は、トレーニングフリーでプラグアンドプレイです。最近リリースされた5つのモデルを用いて複数のベンチマーク評価を行った結果,本手法は様々なアーキテクチャ上で異なるLVLMの幻覚を連続的に緩和できることがわかった。コードはhttps://github.com/ytx-ML/ILVAD.comで入手できる。

論文の概要: Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

関連論文リスト