Fugu-MT 論文翻訳(概要): Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

論文の概要: Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

arxiv url: http://arxiv.org/abs/2601.06224v2
Date: Tue, 13 Jan 2026 07:12:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-14 14:06:39.258959
Title: Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization
Title（参考訳）: 幻覚耐性MLLMのキャプションフィードバック,多様性を考慮したサンプリング,および競合正則化
Authors: Miao Pan, Wangjie Gan, Jintao Chen, Wenqi Zhang, Bing Sun, Jianwei Yin, Xuhong Zhang,
Abstract要約: マルチモーダル大言語モデル(MLLM)における幻覚の根本原因を系統的に解析する。 1)不正確な初期記述が後続の推論を誤った前提に固定する連鎖的視覚推論の過度な信頼、(2)政策最適化中の探索の多様性が不十分で、過度に自信があるが誤ったアウトプットを発生させる要因、(3)トレーニングサンプル間の破壊的な衝突、NTKの類似性が誤関連や不安定なパラメータ更新を引き起こす要因である。実験の結果,提案手法は幻覚率を著しく低減し,MLLMの推論精度を効果的に向上することが示された。
参考スコア（独自算出の注目度）: 38.469173375694076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse tasks, their practical deployment is severely hindered by hallucination issues, which become particularly acute during Reinforcement Learning (RL) optimization. This paper systematically analyzes the root causes of hallucinations in MLLMs under RL training, identifying three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions or redundant information anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where Neural Tangent Kernel (NTK) similarity causes false associations and unstable parameter updates. To address these challenges, we propose a comprehensive framework comprising three core modules. First, we enhance visual localization by introducing dedicated planning and captioning stages before the reasoning phase, employing a quality-based caption reward to ensure accurate initial anchoring. Second, to improve exploration, we categorize samples based on the mean and variance of their reward distributions, prioritizing samples with high variance to focus the model on diverse and informative data. Finally, to mitigate sample interference, we regulate NTK similarity by grouping sample pairs and applying an InfoNCE loss to push overly similar pairs apart and pull dissimilar ones closer, thereby guiding gradient interactions toward a balanced range. Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は様々なタスクにおいて顕著な成功を収めてきたが、その実践的展開は、強化学習(RL)最適化において特に急激な幻覚の問題によって著しく妨げられている。本稿では,RLトレーニングにおけるMLLMの幻覚の根本原因を系統的に分析し,(1)不正確な初期記述や余分な情報を不正確な前提に固定する連鎖的視覚推論への過度依存,(2)政策最適化中の探索の多様性が不十分であること,(3)過度に自信があるが誤った出力を生成すること,(3)トレーニングサンプル間の破壊的衝突,(3)ニューラル・タンジェント・ケルン(NTK)類似性が誤関連や不安定なパラメータ更新を引き起こすこと,の3つの重要な要因を同定する。これらの課題に対処するため、3つのコアモジュールからなる包括的なフレームワークを提案する。まず、推論フェーズの前に専用の計画とキャプションステージを導入し、品質ベースのキャプション報酬を用いて正確な初期アンカーを確保することにより、視覚的ローカライゼーションを強化する。第2に,報奨分布の平均と分散に基づいてサンプルを分類し,分散度の高いサンプルを優先順位付けし,多種多様・情報的データに焦点をあてる。最後に、サンプル干渉を軽減するため、サンプルペアをグループ化してInfoNCE損失を適用して、類似したペアを分割し、異種を近くに引き寄せ、バランスの取れた範囲に向けて勾配相互作用を導くことにより、NTK類似性を規制する。実験の結果,提案手法は幻覚率を著しく低減し,MLLMの推論精度を効果的に向上することが示された。

論文の概要: Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

関連論文リスト