Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding
- URL: http://arxiv.org/abs/2402.15300v2
- Date: Tue, 23 Apr 2024 09:32:25 GMT
- Title: Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding
- Authors: Ailin Deng, Zhirui Chen, Bryan Hooi,
- Abstract summary: Large Vision-Language Models (LVLMs) are susceptible to object hallucinations.
Current approaches often rely on the model's token likelihoods or other internal information.
We introduce our CLIP-Guided Decoding approach to reduce object hallucination at decoding time.
- Score: 36.81476620057058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality. Current approaches often rely on the model's token likelihoods or other internal information, instruction tuning on additional datasets, or incorporating complex external tools. We first perform empirical analysis on sentence-level LVLM hallucination, finding that CLIP similarity to the image acts as a stronger and more robust indicator of hallucination compared to token likelihoods. Motivated by this, we introduce our CLIP-Guided Decoding (CGD) approach, a straightforward but effective training-free approach to reduce object hallucination at decoding time. CGD uses CLIP to guide the model's decoding process by enhancing visual grounding of generated text with the image. Experiments demonstrate that CGD effectively mitigates object hallucination across multiple LVLM families while preserving the utility of text generation. Codes are available at https://github.com/d-ailin/CLIP-Guided-Decoding.
Related papers
- CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base [29.477973983931083]
We propose CutPaste&Find, a lightweight and training-free framework for detecting hallucinations in LVLM-generated outputs.
At the core of our framework is a Visual-aid Knowledge Base that encodes rich entity-attribute relationships and associated image representations.
We introduce a scaling factor to refine similarity scores, mitigating the issue of suboptimal alignment values even for ground-truth image-text pairs.
arXiv Detail & Related papers (2025-02-18T07:06:36Z) - Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.
We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.
LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.
We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in multimodal task reasoning.
We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfDecoding (VaLiD)
arXiv Detail & Related papers (2024-11-24T13:42:02Z) - MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.
We propose a novel dynamic correction decoding method for MLLMs (DeCo)
We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks.
These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images.
We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z) - Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning? [29.237078890377514]
Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content.
Using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image.
This paper proposes a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics.
arXiv Detail & Related papers (2024-06-18T14:33:56Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Mitigating Object Hallucinations in Large Vision-Language Models through
Visual Contrastive Decoding [125.05295513481035]
We introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs.
The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations.
Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families.
arXiv Detail & Related papers (2023-11-28T16:26:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.