Related papers: Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

URL: http://arxiv.org/abs/2602.11737v1
Date: Thu, 12 Feb 2026 09:04:28 GMT
Title: Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
Authors: Boqi Chen, Xudong Liu, Jianing Qiu,
Abstract summary: We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD)<n>We leverage object-centric attention in self-supervised Vision Transformers.<n>In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal.
Score: 17.902539922664563
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Related papers

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z)
MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding [53.068815533016355]
We propose image head Masked Contrastive Decoding (MaskCD) for large vision-language models (LVLMs)<n>Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding.<n>The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs.
arXiv Detail & Related papers (2025-10-03T07:59:16Z)
Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding [18.980167452015966]
We propose a simple approach called Layer Contrastive Decoding (LayerCD)<n>LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels.<n>We conduct extensive experiments on two benchmarks and show that LayerCD significantly outperforms current state-of-the-art.
arXiv Detail & Related papers (2025-09-29T17:59:16Z)
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning [69.64660280965971]
VideoAnchor is a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining.<n>We show consistent performance gains on benchmarks with InternVL2-8B and Q2.5VL-72B.<n>Our codes will be made public at https://github.com/feufhd/VideoAnchor.
arXiv Detail & Related papers (2025-09-29T17:54:04Z)
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection [49.26064449816502]
We propose a Gradient-based Influence-Aware Constrained Decoding (GACD) method to address text-visual bias and co-occurrence bias.<n>GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
arXiv Detail & Related papers (2025-09-03T08:13:52Z)
SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding [5.976839106353883]
SECOND: Selective and Contrastive Decoding is a novel approach that enables Vision-Language Models to leverage multi-scale visual information with an object-centric manner.<n> SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks.
arXiv Detail & Related papers (2025-06-10T02:55:38Z)
EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models [54.234657224615354]
Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks.<n>Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data.<n>Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation.<n>We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training.
arXiv Detail & Related papers (2025-01-06T00:39:31Z)
ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models [11.75855265467876]
We introduce ConVis, a training-free contrastive decoding method. Our experiments on five popular benchmarks demonstrate that ConVis effectively reduces hallucinations across various MLLMs.
arXiv Detail & Related papers (2024-08-25T18:02:36Z)
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding [125.05295513481035]
We introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families.
arXiv Detail & Related papers (2023-11-28T16:26:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.