OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination
- URL: http://arxiv.org/abs/2509.00723v1
- Date: Sun, 31 Aug 2025 07:19:32 GMT
- Title: OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination
- Authors: Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, Xuming Hu,
- Abstract summary: We propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in Omni-modal large language models (OLLMs)<n>By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination.
- Score: 32.43796002503023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Omni-modal large language models (OLLMs) have sparked a new wave of research, achieving impressive results in tasks such as audio-video understanding and real-time environment perception. However, hallucination issues still persist. Similar to the bimodal setting, the priors from the text modality tend to dominate, leading OLLMs to rely more heavily on textual cues while neglecting visual and audio information. In addition, fully multimodal scenarios introduce new challenges. Most existing models align visual or auditory modalities with text independently during training, while ignoring the intrinsic correlations between video and its corresponding audio. This oversight results in hallucinations when reasoning requires interpreting hidden audio cues embedded in video content. To address these challenges, we propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in OLLMs. Specifically, OmniDPO incorporates two strategies: (1) constructing text-preference sample pairs to enhance the model's understanding of audio-video interactions; and (2) constructing multimodal-preference sample pairs to strengthen the model's attention to visual and auditory information. By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination. Experiments conducted on two OLLMs demonstrate that OmniDPO not only effectively mitigates multimodal hallucinations but also significantly enhances the models' reasoning capabilities across modalities. All code and datasets will be released upon paper acceptance.
Related papers
- Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding [41.828387997311474]
Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering.<n>They remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects.<n>We propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference.
arXiv Detail & Related papers (2025-10-21T06:11:24Z) - Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception [97.32606786622728]
We present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark.<n>We propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data.<n>Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception.
arXiv Detail & Related papers (2025-10-14T17:00:09Z) - Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization [55.543583937522804]
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks.<n>Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate.<n>In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations.
arXiv Detail & Related papers (2025-08-27T18:02:04Z) - Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models [27.430040932849018]
We introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual models.<n>Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities.<n>We demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations.
arXiv Detail & Related papers (2024-10-23T23:36:06Z) - OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models.
evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges.
Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware
Direct Preference Optimization [45.53216822981202]
This paper introduces a novel solution, Hallucination-Aware Direct Preference Optimization (HA-DPO), which reframes the hallucination problem as a preference selection task.
When applied to three mainstream multimodal models, HA-DPO significantly reduced hallucination issues and amplified the models' generalization capabilities.
arXiv Detail & Related papers (2023-11-28T14:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.