Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
- URL: http://arxiv.org/abs/2506.04039v1
- Date: Wed, 04 Jun 2025 15:03:50 GMT
- Title: Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
- Authors: Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, Min Zhang,
- Abstract summary: Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment.<n>We propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment.<n> EMPO reduces hallucination rates by 85.9% on Object-HalBench and 49.8% on MM-HalBench.
- Score: 40.68121267969432
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment than existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 85.9% on Object-HalBench and 49.8% on MM-HalBench.
Related papers
- Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs [74.74767980885758]
We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework.<n>CcDPO enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details.<n> Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains.
arXiv Detail & Related papers (2025-05-28T14:24:02Z) - Mitigating Hallucinated Translations in Large Language Models with Hallucination-focused Preference Optimization [1.9204566034368082]
Machine Translation systems are at a higher risk of generating hallucinations.<n>We propose a method that intrinsically learns to mitigate hallucinations during the model training phase.<n>Our approach reduces hallucinations by 89% on an average across three unseen target languages.
arXiv Detail & Related papers (2025-01-28T20:58:43Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.<n>We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.<n>We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Multimodal Preference Data Synthetic Alignment with Reward Model [23.978820500281213]
We propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training.<n>Experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data.
arXiv Detail & Related papers (2024-12-23T09:29:40Z) - A Topic-level Self-Correctional Approach to Mitigate Hallucinations in MLLMs [34.71750379630014]
We introduce Topic-level Preference Overwriting (TPO), a self-correctional approach that guide the model itself to mitigate its own hallucination at the topic level.<n>Results show TPO achieves state-of-the-art performance in trustworthiness, significantly reducing the object hallucinations by 92% and overall hallucinations by 38%.
arXiv Detail & Related papers (2024-11-26T09:42:07Z) - Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization [26.263592737768214]
Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications.
We introduce Hallucination-targeted Direct Preference Optimization (HDPO) to reduce hallucinations in MLLMs.
arXiv Detail & Related papers (2024-11-15T18:56:01Z) - MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models [85.30735602813093]
Multi-Image Augmented Direct Preference Optimization (MIA-DPO) is a visual preference alignment approach that effectively handles multi-image inputs.
MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats.
arXiv Detail & Related papers (2024-10-23T07:56:48Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware
Direct Preference Optimization [45.53216822981202]
This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which reframes the hallucination problem as a preference selection task.
When applied to three mainstream multimodal models, HA-DPO significantly reduced hallucination issues and amplified the models' generalization capabilities.
arXiv Detail & Related papers (2023-11-28T14:54:37Z) - Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination"
We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment.
We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information.
Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.