Fugu-MT 論文翻訳(概要): Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

論文の概要: Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.16558v1
Date: Tue, 17 Mar 2026 14:19:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.324828
Title: Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models
Title（参考訳）: セグメンテーションに基づく注意エントロピー:大規模視覚言語モデルにおける物体の幻覚の検出と緩和
Authors: Jiale Song, Jiaxin Luo, Xue-song Tang, Kuangrong Hao, Mingbo Zhao,
Abstract要約: LVLM(Large Vision-Language Models)は多くのマルチモーダルタスクにおいて高い性能を達成するが、オブジェクト幻覚は信頼性を著しく損なう。現存する研究の多くは、過度に強い言語に幻覚をもたらし、視覚的根拠が不十分なテキストモダリティに焦点を当てている。本研究では,意味的セグメンテーションを利用してオブジェクトレベルの意味空間における視覚的注意の不確かさを定量化する注意エントロピー(SAE)を提案する。
参考スコア（独自算出の注目度）: 9.388076929154673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は多くのマルチモーダルタスクにおいて高い性能を達成するが、オブジェクト幻覚は信頼性を著しく損なう。現存する研究の多くは、過度に強い言語に幻覚をもたらし、視覚的根拠が不十分なテキストモダリティに焦点を当てている。対照的に、視覚的モダリティの異常な注意パターンは、幻覚的物体を生じさせる可能性がある。本研究では,意味的セグメンテーションを利用してオブジェクトレベルのセグメンテーション空間における視覚的注意の不確実性を定量化するセグメンテーションに基づく注意エントロピー(SAE)を提案する。 SAEに基づいて、幻覚検出のための信頼性スコアと、推論時の視覚的注意を調整して幻覚を緩和するSAE誘導注意調整法をさらに設計する。我々は,四足歩行ロボットを用いた公開ベンチマークと実実施型マルチモーダルシナリオにおけるアプローチを評価した。実験の結果,SAEはトレーニングコストを伴わずに物体の幻覚を著しく低減し,より信頼性の高いLVLMによる知覚と意思決定を可能にした。

論文の概要: Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

関連論文リスト