Fugu-MT 論文翻訳(概要): Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

論文の概要: Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

arxiv url: http://arxiv.org/abs/2605.01766v1
Date: Sun, 03 May 2026 07:58:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.928509
Title: Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
Title（参考訳）: 推論時間における関連伝播によるマルチモーダルLLMの幻覚の緩和
Authors: Itai Allouche, Joseph Keshet,
Abstract要約: マルチモーダルな大規模言語モデル(MLLM)は、AIの展望に革命をもたらした。これらのモデルは、しばしば幻覚に悩まされ、提供された知覚入力から分岐する出力を生成する。マルチモーダルグラウンド化を促進するために,Learning Inference-time Modality Enhancement (LIME)を提案する。
参考スコア（独自算出の注目度）: 9.870369982132678
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have revolutionized the landscape of AI, demonstrating impressive capabilities in tackling complex vision and audio-language tasks. However, a critical challenge remains: these models often suffer from hallucinations, generating outputs that diverge from the provided perceptual inputs. This tendency stems from an inherent imbalance in modality utilization during inference, where the dominance of textual tokens undermines the potential of perceptual inputs. As a result, the model frequently resorts to textual language priors at the expense of grounded evidence. To tackle this issue, we propose Learning Inference-time Modality Enhancement (LIME), a training-free framework designed to bolster multimodal grounding by explicitly enhancing modality usage during decoding. LIME leverages Layer-wise Relevance Propagation (LRP) to quantify token-level contributions and defines a relevance-based objective that promotes increased reliance on perceptual inputs. This objective is enforced through inference-time updates to the model's key-value representations, without modifying model parameters or requiring additional training data. We evaluate LIME across multiple multimodal benchmarks in both vision and audio domains, demonstrating consistent reductions in hallucinations and enhanced grounding while preserving generation quality. Further analysis shows that LIME increases modality contribution and produces more localized and semantically aligned relevance patterns.
Abstract（参考訳）: マルチモーダルな大規模言語モデル(MLLM)はAIの風景に革命をもたらし、複雑なビジョンとオーディオ言語タスクに対処する素晴らしい能力を誇示している。しかし、重要な課題が残る:これらのモデルはしばしば幻覚に悩まされ、提供された知覚入力から分岐する出力を生成する。この傾向は、テキストトークンの優位性が知覚入力の可能性を損なう、推論中のモダリティ利用における固有の不均衡に起因する。結果として、このモデルは、根拠のある証拠を犠牲にして、しばしばテキスト言語に頼っている。この問題に対処するために,復号時のモダリティ使用率を明確に向上させることで,マルチモーダルグラウンド化を促進するための学習自由度フレームワークであるLearning Inference-time Modality Enhancement (LIME)を提案する。 LIMEは、LRP(Layer-wise Relevance Propagation)を活用してトークンレベルのコントリビューションを定量化し、知覚入力への依存の増大を促進する関連ベースの目的を定義する。この目的は、モデルパラメータを変更したり、追加のトレーニングデータを必要とすることなく、モデルのキー-値表現を推論時に更新することで実現される。視覚領域と音声領域の複数のマルチモーダル・ベンチマークでLIMEを評価し,生成品質を保ちながら,幻覚の持続的な減少とグラウンド化の強化を実証した。さらなる分析により、LIMEはモダリティの寄与を増大させ、より局所的でセマンティックに整合した関連パターンを生成することが示されている。

論文の概要: Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

関連論文リスト