Fugu-MT 論文翻訳(概要): Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

論文の概要: Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

arxiv url: http://arxiv.org/abs/2605.11808v1
Date: Tue, 12 May 2026 09:03:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.73842
Title: Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
Title（参考訳）: リレーショナル・アウェア・ビジュアル・エンハンスメントによるLVLMの行動関連幻覚の緩和
Authors: Zhenxin Qin, Qiang Li, Qingzhuo Wang, Ruiyang Qin, Zhihua Wei, Wen Shen,
Abstract要約: LVLM(Large Vision-Language Models)は、様々な視覚言語タスクにおいて顕著なパフォーマンスを実現している。 LVLMはまだ幻覚に悩まされており、視覚入力と矛盾するテキストを生成する。本稿では,行動関連画像領域を特定し,LVLMの注目度を高めるための枠組みを提案する。
参考スコア（独自算出の注目度）: 8.424418427339337
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、様々な視覚言語タスクにおいて顕著なパフォーマンスを実現している。しかし、LVLMはまだ幻覚に悩まされており、視覚入力と矛盾するテキストを生成する。既存の研究は主に対象の幻覚を緩和することに焦点を当ててきたが、しばしばより複雑な関係の幻覚、特に物体間の相互作用に関わる行動関係を見落としている。本研究では,LVLMにおける行動関連幻覚の主な原因が視覚情報に割り当てられる注意力の不足であることを実証的に観察した。そこで本研究では,行動関連画像領域を特定し,それらの領域に対するLVLMの関心を高めるための枠組みを提案する。具体的には、アクション関連感性(ARS)スコアを定義し、アクション関連性の変化に最も敏感なアテンションヘッドを特定し、重要な視覚的手がかりを含むアクション関連画像領域を局所化する。そこで本研究では,これらのアクション関連画像領域に対するLVLMの注目度を高めるためのRVE手法を提案する。実験の結果,既存のベースラインと比較して,行動関連幻覚の緩和に要しない追加推論コストの低減に優れた性能が得られた。さらに、空間関連幻覚や対象幻覚に効果的に一般化する。

論文の概要: Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

関連論文リスト