Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
- URL: http://arxiv.org/abs/2509.24491v1
- Date: Mon, 29 Sep 2025 09:03:36 GMT
- Title: Mitigating Visual Hallucinations via Semantic Curriculum Preference Optimization in MLLMs
- Authors: Yuanshuai Li, Yuping Yan, Junfeng Tang, Yunxuan Li, Zeqi Zheng, Yaochu Jin,
- Abstract summary: Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations.<n>We propose Semantic Curriculum Preference Optimization ( SCPO), a novel framework for MLLM alignment.<n> SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty.
- Score: 21.509992905027023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have significantly improved the performance of various tasks, but continue to suffer from visual hallucinations, a critical issue where generated responses contradict visual evidence. While Direct Preference Optimization(DPO) is widely used for alignment, its application to MLLMs often fails to capture fine-grained semantic differences and encourages shortcut learning. To address these challenges, we propose Semantic Curriculum Preference Optimization (SCPO), a novel framework for MLLM alignment. SCPO employs a progressive, easy-to-hard curriculum built upon our Semantic Curriculum Preference Pairs dataset, which provides fine-grained semantic contrasts sorted by difficulty. This curriculum is trained with a dynamic reference model and a novel symmetric, bidirectional objective to facilitate simultaneous learning from both textual and visual preferences. To our knowledge, SCPO is the first framework to unify semantics, symmetry, and curriculum for MLLMs alignment, effectively mitigating visual hallucinations. Extensive experiments on LLaVA models across various scales and versions validate that SCPO demonstrates superior performance compared to baseline models on multiple hallucination benchmarks, reducing the hallucination rate by up to 62.9%. Moreover, evaluations on generalized benchmarks show that SCPO improves factuality while preserving general capabilities, with its performance remaining stable across general vision-language benchmarks.
Related papers
- MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification [16.895269678640595]
In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides.<n>Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement.<n>We propose the MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation.
arXiv Detail & Related papers (2026-02-24T13:17:35Z) - Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation [60.33386541343322]
We propose a Multimodal Large Language Models framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec)<n>Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness.
arXiv Detail & Related papers (2025-11-24T04:10:46Z) - Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization [69.05600758833471]
Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs)<n>We propose a Symmetric Multimodal Preference Optimization (SymMPO) which conducts symmetric preference learning with direct preference supervision (i.e., response pairs)<n>In addition to conventional ordinal preference learning, SymMPO introduces a preference margin consistency loss to quantitatively regulate the preference gap between symmetric preference pairs.
arXiv Detail & Related papers (2025-06-13T12:29:15Z) - Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs [74.74767980885758]
We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework.<n>CcDPO enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details.<n> Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains.
arXiv Detail & Related papers (2025-05-28T14:24:02Z) - AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization [26.03204301595711]
We propose an Adaptive Vision-enhanced Preference optimization (AdaViP) that addresses limitations through two key innovations.<n> vision-based preference pair construction integrates multiple visual foundation models to strategically remove key visual elements from the image.<n>AdaViP-7B achieves 93.7% and 96.4% reductions in response-level and mentioned-level hallucination respectively on the Object HalBench.
arXiv Detail & Related papers (2025-04-22T06:19:38Z) - Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization [40.77611907215627]
Large Vision Language Models (VLMs) are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies.<n>We introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset.<n>We also introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective [3.2418962303343863]
This paper categorizes and reviews 34 Vision Large Language Models (VLLMs) from top conferences, journals, and highly cited Arxiv papers.<n>We first introduce the architecture of Large Language Models and parameter-efficient learning methods, followed by a discussion on vision encoders and a comprehensive taxonomy of modality encoders.
arXiv Detail & Related papers (2025-02-03T17:01:59Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.<n>We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.<n>We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)<n>Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)<n>We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [48.455597568212944]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.<n>In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data.
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.