Fugu-MT 論文翻訳(概要): Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

論文の概要: Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

arxiv url: http://arxiv.org/abs/2509.24491v1
Date: Mon, 29 Sep 2025 09:03:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.883125
Title: Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
Title（参考訳）: MLLMにおける意味的カリキュラム選好最適化による視覚幻覚の緩和
Authors: Yuanshuai Li, Yuping Yan, Junfeng Tang, Yunxuan Li, Zeqi Zheng, Yaochu Jin,
Abstract要約: MLLM(Multimodal Large Language Models)は様々なタスクのパフォーマンスを大幅に向上させたが、視覚幻覚に悩まされ続けている。本稿では,MLLMアライメントのための新しいフレームワークであるSemantic Curriculum Preference Optimization (SCPO)を提案する。 SCPOは、私たちのSemantic Curriculum Preference Pairsデータセット上に構築された、進歩的で、容易にハードなカリキュラムを採用しています。
参考スコア（独自算出の注目度）: 21.509992905027023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization(DPO) is widely used for alignment, its application to MLLMs often fails to capture fine-grained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は様々なタスクのパフォーマンスを著しく向上させたが、生成した応答が視覚的証拠と矛盾する重要な問題である視覚幻覚に悩まされ続けている。直接選好最適化(DPO)はアライメントに広く用いられているが、MLLMへの応用は細粒度のセマンティックな違いを捉えることに失敗し、ショートカット学習を促進する。これらの課題に対処するために,MLLMアライメントのための新しいフレームワークであるSemantic Curriculum Preference Optimization (SCPO)を提案する。 SCPOは、私たちのSemantic Curriculum Preference Pairsデータセット上に構築された、進歩的で、容易にハードなカリキュラムを採用しています。このカリキュラムは、動的参照モデルと、テキストと視覚の両方の嗜好から同時学習を容易にするために、対称的で双方向な新しい目的で訓練されている。我々の知る限り、SCPOはMLLMアライメントのための意味論、対称性、カリキュラムを統一する最初のフレームワークであり、視覚幻覚を効果的に緩和する。様々なスケールおよびバージョンにわたるLLaVAモデルの大規模な実験により、SCPOは複数の幻覚ベンチマークのベースラインモデルよりも優れた性能を示し、幻覚率を62.9%まで下げた。さらに、一般化されたベンチマークの評価では、SCPOは一般の能力を保ちながら事実性を向上し、その性能は一般のビジョンベンチマークで安定している。

論文の概要: Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs

関連論文リスト