Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
- URL: http://arxiv.org/abs/2505.19474v1
- Date: Mon, 26 May 2025 03:53:00 GMT
- Title: Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
- Authors: Xinmiao Hu, Chun Wang, Ruihe An, ChenYu Shao, Xiaojun Ye, Sheng Zhou, Liangcheng Li,
- Abstract summary: Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual understanding tasks.<n>They often suffer from object hallucinations--generating descriptions of objects that are inconsistent with or entirely absent from the input.<n>We propose a causality-driven disentanglement framework that mitigates hallucinations through causal intervention.
- Score: 12.31624700475112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual understanding tasks, yet they often suffer from object hallucinations--generating descriptions of objects that are inconsistent with or entirely absent from the input. This issue is closely related to dataset biases, where frequent co-occurrences of objects lead to entangled semantic representations across modalities. As a result, models may erroneously activate object representations that are commonly associated with the input but not actually present. To address this, we propose a causality-driven disentanglement framework that mitigates hallucinations through causal intervention. Our approach includes a Causal-Driven Projector in the visual pathway and a Causal Intervention Module integrated into the final transformer layer of the language model. These components work together to reduce spurious correlations caused by biased training data. Experimental results show that our method significantly reduces hallucinations while maintaining strong performance on multiple multimodal benchmarks. Visualization analyses further confirm improved separability of object representations. The code is available at: https://github.com/IgniSavium/Causal-LLaVA
Related papers
- PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs [23.69973859198496]
Multimodal Large Language Models (MLLMs) excel in vision-language tasks, such as image captioning and visual question answering.<n>They often suffer from over-reliance on spurious correlations, primarily due to linguistic priors that distract the model from leveraging actual visual information.<n>We introduce MMed-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities and mitigate the hallucinations of MLLMs.
arXiv Detail & Related papers (2025-06-22T05:11:46Z) - Mitigating Object Hallucination via Robust Local Perception Search [11.570368427723961]
Local Perception Search (LPS) is a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations.<n>We show that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.
arXiv Detail & Related papers (2025-06-07T09:27:26Z) - MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM [58.2298313720146]
Multimodal hallucinations are multi-sourced and arise from diverse causes.<n>Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations.
arXiv Detail & Related papers (2025-05-30T05:54:36Z) - Towards Long Context Hallucination Detection [49.195854802543714]
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks.<n>They are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context.<n>We propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations.
arXiv Detail & Related papers (2025-04-28T03:47:05Z) - ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models [28.24397677839652]
Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs)<n>We propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model's middle layers.<n>VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.
arXiv Detail & Related papers (2025-03-17T12:30:40Z) - Treble Counterfactual VLMs: A Causal Approach to Hallucination [6.3952983618258665]
VisionLanguage Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning.<n>They often generate hallucinated outputs inconsistent with the visual context or prompt.<n>Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding.
arXiv Detail & Related papers (2025-03-08T11:13:05Z) - EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models [54.234657224615354]
Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks.<n>Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data.<n>Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation.<n>We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training.
arXiv Detail & Related papers (2025-01-06T00:39:31Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.<n>We present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers.<n>Motivated by this, we propose a novel dynamic correction decoding method for MLLMs DeCo, which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - Lower Layer Matters: Alleviating Hallucination via Multi-Layer Fusion Contrastive Decoding with Truthfulness Refocused [44.37155553647802]
Large Language Models (LLMs) have demonstrated exceptional performance across various natural language processing tasks.
They occasionally yield content that factually inaccurate or discordant with the expected output.
Recent works have investigated contrastive decoding between the original model and an amateur model with induced hallucination.
We introduce a novel contrastive decoding framework termed LOL (LOwer Layer Matters)
arXiv Detail & Related papers (2024-08-16T14:23:59Z) - Cross-Modal Causal Relational Reasoning for Event-Level Visual Question
Answering [134.91774666260338]
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes.
We propose a framework for cross-modal causal relational reasoning to address the task of event-level visual question answering.
arXiv Detail & Related papers (2022-07-26T04:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.