Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
- URL: http://arxiv.org/abs/2512.00706v1
- Date: Sun, 30 Nov 2025 02:55:20 GMT
- Title: Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation
- Authors: Chengzhi Yu, Yifan Xu, Yifan Chen, Wenyi Zhang,
- Abstract summary: We analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data.<n>We propose training a hallucination giving binary annotations, which guarantee clean chosen samples for the subsequent alignment.<n>In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%.
- Score: 14.556157904513602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge.In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model's hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.
Related papers
- Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization [65.12217781259525]
Existing preference alignment methods focus on aligning model responses with human preferences.<n>We propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment.<n> EMPO reduces hallucinations rates by 85.9% on Object-HalBench and 49.8% on MM-HalBench.
arXiv Detail & Related papers (2025-06-04T15:03:50Z) - Mitigating Image Captioning Hallucinations in Vision-Language Models [13.707454974844095]
Hallucinations in vision-language models hinder reliability and real-world applicability.<n>We propose a novel test-time adaptation framework using reinforcement learning to mitigate hallucinations during inference.<n>Our approach outperforms state-of-the-art baselines with a 68.3% improvement in hallucination mitigation.
arXiv Detail & Related papers (2025-05-06T10:55:21Z) - Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations.<n>In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%.<n> DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.<n>We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.<n>We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences.<n>Recent work has shown DPO's effectiveness relies on training data quality.<n>We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z) - Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key [24.229983103296988]
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs)<n>We propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses.<n>OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark.
arXiv Detail & Related papers (2025-01-16T17:48:03Z) - Multimodal Preference Data Synthetic Alignment with Reward Model [23.978820500281213]
We propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training.<n>Experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data.
arXiv Detail & Related papers (2024-12-23T09:29:40Z) - EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
We propose Enhancing Alignment in MLLMs via Critical Observation (EACO)<n>EACO aligns MLLMs by self-generated preference data using only 5k images economically.<n>EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
arXiv Detail & Related papers (2024-12-06T09:59:47Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.