Context-Aware Decoding for Faithful Vision-Language Generation
- URL: http://arxiv.org/abs/2601.05939v1
- Date: Fri, 09 Jan 2026 16:50:57 GMT
- Title: Context-Aware Decoding for Faithful Vision-Language Generation
- Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu,
- Abstract summary: Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs)<n>We probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy.
- Score: 5.258492912374723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
Related papers
- Hallucination Begins Where Saliency Drops [18.189047289404325]
hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token.<n>We introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token.<n>Our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution.
arXiv Detail & Related papers (2026-01-28T05:50:52Z) - SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z) - CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z) - Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z) - Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models [3.9464481148889354]
We propose a novel decoding mechanism, Decoding with Inter-layer Consistency via Layer Aggregation (DCLA)<n>Our approach constructs a dynamic semantic reference by aggregating representations from previous layers, and corrects semantically deviated layers to enforce inter-layer consistency.<n> Experiments on hallucination benchmarks such as MME and POPE demonstrate that DCLA effectively reduces hallucinations while enhancing the reliability and performance of LVLMs.
arXiv Detail & Related papers (2025-05-18T10:15:42Z) - Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.<n>Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.<n>Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z) - Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [65.4610281589017]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models [15.401221354325672]
Hallucinations in large vision models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input.<n>Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to extract or decouple visual features.<n>In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling)
arXiv Detail & Related papers (2024-10-09T11:46:32Z) - MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost.<n>Based on a statistical analysis, we reveal key factors of hallucinations in Large Vision Language Models.<n>We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.