Fugu-MT 論文翻訳(概要): Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

論文の概要: Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

arxiv url: http://arxiv.org/abs/2603.27201v1
Date: Sat, 28 Mar 2026 08:56:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.845238
Title: Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
Title（参考訳）: マルチモーダル・チェーン・オブ・サートモデルにおける幻覚の理解と緩和
Authors: Ji Ma, Wei Suo, Peng Wang, Yanning Zhang,
Abstract要約: MCoT(Multimodal Chain-of-Thought)モデルは、複雑な視覚的推論タスクにおいて印象的な能力を示す。近年の研究では、生成過程における視覚的注意の低下により、深刻な幻覚障害に悩まされていることが判明している。本稿では,多様な思考ステップを効果的にローカライズし,幻覚を緩和するデコードプロセスに介入する戦略を提案する。
参考スコア（独自算出の注目度）: 40.739279930631334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.
Abstract（参考訳）: MCoT(Multimodal Chain-of-Thought)モデルは、複雑な視覚的推論タスクにおいて印象的な能力を示す。残念なことに、最近の研究では、生成過程における視覚的注意の低下により、深刻な幻覚障害に悩まされていることが明らかになっている。しかし、視覚的注意減衰はLVLM(Large Vision-Language Models)においてよく研究されている問題である。 MCoTモデルと従来のLVLMの推論過程の根本的な違いを考えると、MCoTモデルに幻覚のユニークな原因があるかどうかという根本的な疑問が提起される。この疑問に対処するために,我々はMCoTモデルの幻覚パターンを体系的に検討し,生成したテキストが主に連想的推論ステップで生成されることを発見した。これらの知見を生かして,多様な思考ステップを効果的にローカライズし,幻覚を緩和するための復号プロセスに介入する,シンプルかつ効果的な戦略を導入する。大規模な実験により,本手法は既存手法よりも高い性能を示した。さらに重要なことは,提案手法を他の幻覚緩和法と便利に統合し,その性能をさらに向上させることである。コードはhttps://github.com/ASGO-MM/MCoT-hallucinationで公開されている。

論文の概要: Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

関連論文リスト