Related papers: Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

URL: http://arxiv.org/abs/2505.19678v1
Date: Mon, 26 May 2025 08:36:10 GMT
Title: Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Authors: Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Tao Liang, Guojun Ma, Shu-Tao Xia,
Abstract summary: Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
Score: 42.871396640891334
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.

Related papers

Context-Aware Decoding for Faithful Vision-Language Generation [5.258492912374723]
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs)<n>We probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy.
arXiv Detail & Related papers (2026-01-09T16:50:57Z)
Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.<n>Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.<n>Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z)
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training [56.172959986096316]
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs)<n>HalFscore is a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level.<n>PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations.
arXiv Detail & Related papers (2025-03-09T07:07:03Z)
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding [14.701135083174918]
Large Vision-Language Models (LVLMs) generate detailed and coherent responses from visual inputs.<n>They are prone to generate hallucinations due to an over-reliance on language priors.<n>We propose a novel method, Summary-Guided Decoding (SumGD)
arXiv Detail & Related papers (2024-10-17T08:24:27Z)
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models [30.26685485474035]
Large Vision-Language Models (LVLMs) have rapidly advanced in recent years.<n>The prevalent issue known as the hallucination' problem has emerged as a significant bottleneck.<n>We propose a simple yet effective method named Self-Introspective Decoding (SID)
arXiv Detail & Related papers (2024-08-04T13:50:17Z)
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models [51.70129969269271]
We introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE) Our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs.
arXiv Detail & Related papers (2024-06-04T03:04:21Z)
IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding [37.16880672402059]
Over-reliance on linguistic priors has been identified as a key factor leading to hallucinations. We propose to alleviate this problem by introducing a novel image-biased decoding technique. Our method derives the next-token probability distribution by contrasting predictions from a conventional LVLM with those of an image-biased LVLM.
arXiv Detail & Related papers (2024-02-28T16:57:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.