MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding
- URL: http://arxiv.org/abs/2510.02790v1
- Date: Fri, 03 Oct 2025 07:59:16 GMT
- Title: MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding
- Authors: Jingyuan Deng, Yujiu Yang,
- Abstract summary: We propose image head Masked Contrastive Decoding (MaskCD) for large vision-language models (LVLMs)<n>Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding.<n>The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs.
- Score: 53.068815533016355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .
Related papers
- Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding [17.902539922664563]
We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD)<n>We leverage object-centric attention in self-supervised Vision Transformers.<n>In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal.
arXiv Detail & Related papers (2026-02-12T09:04:28Z) - Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z) - Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding [18.980167452015966]
We propose a simple approach called Layer Contrastive Decoding (LayerCD)<n>LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels.<n>We conduct extensive experiments on two benchmarks and show that LayerCD significantly outperforms current state-of-the-art.
arXiv Detail & Related papers (2025-09-29T17:59:16Z) - CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models [75.88232735646018]
Large Vision-Language Models (LVLMs) process multimodal inputs consisting of text tokens and vision tokens extracted from images or videos.<n>Existing methods attempt to prune redundant vision tokens, revealing substantial redundancy in visual representations.<n>We propose CoViPAL, a layer-wise contextualized visual token pruning method that employs a Plug-and-Play Pruning Module (PPM) to predict and remove redundant vision tokens before they are processed by the LVLM.
arXiv Detail & Related papers (2025-08-24T07:47:00Z) - MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation [0.11704154007740833]
Memory-Driven Sparse Attention Matrix (MDSAM) is a training-free approach that dynamically captures and refines the attention allocated to image tokens at each layer.<n>MDSAM memorizes attention patterns and activates updates through alignment during decoding, enhancing focus on relevant image tokens while effectively reducing hallucinations.
arXiv Detail & Related papers (2025-06-21T09:49:16Z) - ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM [12.091189146069198]
Multimodal Large Language Model (MLLM) often suffer from hallucinations.<n>They over-rely on partial cues and generate incorrect responses.<n>Recent methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations.
arXiv Detail & Related papers (2025-06-17T17:58:11Z) - Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z) - Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models [39.9447198156097]
Mixture of Decoding (MoD) is a novel approach for hallucination mitigation.<n>It adapts decoding strategies by evaluating the correctness of the model's attention on image tokens.<n>MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks.
arXiv Detail & Related papers (2025-05-17T09:44:18Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Mitigating Object Hallucinations in Large Vision-Language Models through
Visual Contrastive Decoding [125.05295513481035]
We introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs.
The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations.
Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families.
arXiv Detail & Related papers (2023-11-28T16:26:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.