DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations
- URL: http://arxiv.org/abs/2601.00623v1
- Date: Fri, 02 Jan 2026 09:41:54 GMT
- Title: DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations
- Authors: Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, Xuming He,
- Abstract summary: Multimodal Large Language Models (MLLMs) tend to overemphasize easily distinguishable preference pairs.<n>We propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process.
- Score: 22.299736215070343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision--language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA-DPO/.
Related papers
- MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization [4.088161686930475]
We propose Modality-Decoupled Direct Preference Optimization (MoD-DPO) for improving modality grounding in omni LLMs.<n>MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities.<n>Experiments demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines.
arXiv Detail & Related papers (2026-03-03T17:50:24Z) - Small-Margin Preferences Still Matter-If You Train Them Right [24.058773077803895]
We show that pair difficulty interacts strongly with the optimization objective.<n>We propose MixDPO, a simple yet effective difficulty-aware training strategy.<n>We show that MixDPO consistently improves alignment over DPO and a range of widely-used variants.
arXiv Detail & Related papers (2026-02-01T01:15:55Z) - Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation [60.33386541343322]
We propose a Multimodal Large Language Models framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec)<n>Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness.
arXiv Detail & Related papers (2025-11-24T04:10:46Z) - Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation [44.04325848740683]
We introduce M2PO: Multi-Pair, Multi-Perspective Preference Optimization.<n>Our framework integrates a multi-perspective reward engine that creates a more robust signal.<n>On challenging WMT21-22 benchmarks, M2PO substantially outperforms existing preference optimization methods.
arXiv Detail & Related papers (2025-10-15T11:30:49Z) - M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following [4.119014132092875]
Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following.<n>M3PO is a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following.<n>M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates.
arXiv Detail & Related papers (2025-08-17T18:07:55Z) - Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining [75.14823970163685]
We propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs.<n>MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases.<n>It can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
arXiv Detail & Related papers (2025-05-20T03:59:05Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.<n>We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.<n>We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.<n>TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - mDPO: Conditional Preference Optimization for Multimodal Large Language Models [52.607764280030196]
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment.
Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement.
We propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference.
arXiv Detail & Related papers (2024-06-17T17:59:58Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.