Fugu-MT 論文翻訳(概要): IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models

論文の概要: IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2508.03469v1
Date: Tue, 05 Aug 2025 14:05:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-06 18:18:55.998182
Title: IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models
Title（参考訳）: IKOD:大規模視覚言語モデルにおける視覚的注意力低下の軽減
Authors: Jiabing Yang, Chenhang Cui, Yiyang Zhou, Yixiang Chen, Peng Xia, Ying Wei, Tao Yu, Yan Huang, Liang Wang,
Abstract要約: 本稿では,LVLM(Large Vision-Language Models)が,シーケンス長の増大に伴って幻覚が増大する長期バイアスを示すことを示す。我々は、より画像中心のシーケンスを生成する協調デコーディング戦略である、イメージアテンション誘導キー値マージcOllaborative Decoding (IKOD)を提案する。
参考スコア（独自算出の注目度）: 20.036659182106806
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to "hallucinations", outputs that are not grounded in the corresponding images. Many efforts have been made to address these issues, but each comes with its own limitations, such as high computational cost or expensive dataset annotation. Recent research shows that LVLMs exhibit a long-term bias where hallucinations increase as the sequence length grows, yet the underlying cause remains poorly understood. Building on extensive research into attention mechanisms in LVLMs, we analyze the relationship between this long-term bias and visual attention. In our research, we identify a consistent phenomenon in current LVLMs: the model's attention to visual input diminishes as the generated sequence grows, which we hypothesize to be a key factor contributing to observed increasing hallucinations. Based on these insights, we propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences. This method derives logits from shorter sequences with higher image attention through key-value merging and combines them with those from the original decoding, effectively mitigating attention degradation and suppressing hallucinations while not incurring too much inference cost. Extensive experiments on both hallucination and comprehensive benchmarks demonstrate IKOD's superior effectiveness in mitigating hallucinations and improving comprehensive capacities for LVLMs. Importantly, IKOD requires no additional training or external tools, making it a lightweight and efficient framework applicable to various models.
Abstract（参考訳）: 近年のLVLM(Large Vision-Language Models)の進歩は,複数の領域で顕著に進展している。しかしながら、これらのモデルは、協調推論のための視覚と言語を統合するという固有の課題に直面しており、多くの場合、対応する画像に基づかない「幻覚」を出力する。これらの問題に対処するために多くの努力がなされているが、それぞれに高い計算コストや高価なデータセットアノテーションなど、独自の制限がある。近年の研究では、LVLMは、シーケンスの長さが大きくなるにつれて幻覚が増大する長期的なバイアスを示すが、根本原因はよく分かっていない。 LVLMにおける注意機構の広範な研究に基づいて、この長期的偏見と視覚的注意との関係を分析する。本研究は,現在のLVLMにおける一貫した現象を同定し,生成配列の増大に伴って視覚入力への注意が減少し,幻覚の増大に寄与する重要な要因となると仮定する。これらの知見に基づいて、より画像中心のシーケンスを生成する協調デコーディング戦略である、イメージアテンション誘導キー値マージcOllaborative Decoding (IKOD)を提案する。本手法は、キー値のマージにより、より注目度の高い短いシーケンスからロジットを導出し、それらを元の復号法と組み合わせることで、注意低下を効果的に軽減し、推論コストの過大さを伴わず幻覚を抑制する。幻覚および包括的ベンチマークの広範な実験は、IKODが幻覚を緩和し、LVLMの包括的能力を改善する上で優れた効果を示した。重要なことは、IKODは追加のトレーニングや外部ツールを必要としないため、様々なモデルに適用可能な軽量で効率的なフレームワークである。

論文の概要: IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models

関連論文リスト