Fugu-MT 論文翻訳(概要): Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

論文の概要: Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

arxiv url: http://arxiv.org/abs/2605.21300v1
Date: Wed, 20 May 2026 15:29:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.752042
Title: Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
Title（参考訳）: 画像陰性トークンの強調によるLVLMの物体幻覚の低減
Authors: Meng Shen, Minghao Wu, Deepu Rajan,
Abstract要約: 生成過程を調査し,テキストトークンを画像陽性,不変,負の3つのグループに分類する。分析の結果,ほとんどのトークンは画像情報の影響を最小限に受けていることがわかった。幻覚に対する視覚的依存に応じて異なるトークンのトレーニング重量を調整することを提案する。
参考スコア（独自算出の注目度）: 19.11092776427327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.
Abstract（参考訳）: オブジェクト幻覚は、大規模視覚言語モデル(LVLM)の実践を妨げる重要な課題である。幻覚の起源の1つは、画像との有意義な相互作用よりもテキスト生成を優先する傾向にあると仮定する。そこで本研究では,テキストトークンの生成過程について検討し,入力画像トークンに対する視覚的依存に基づいて,画像陽性,不変,負の3つのグループに分類する。分析の結果,ほとんどのトークンは画像情報の影響を最小限に受けていることがわかった。これは、モデルのトレーニング段階では、画像から情報を取り出すのではなく、テキストの指示に従う方法を学ぶことに重点が置かれていることを示唆している。そこで本研究では,視覚的依存度に応じて異なるトークンのトレーニング重量を調整することを提案する。さらに、データフィルタリング戦略として、より幻覚を含む可能性のあるトレーニングデータの一部を削除します。どちらの手法も、応答長を損なうことなく幻覚を減少させ、推論中に計算コストを増大させる。提案手法を3つのLVLM変種にまたがって検証し,その有効性と汎用性を実証した。

論文の概要: Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

関連論文リスト