Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models
- URL: http://arxiv.org/abs/2502.16842v1
- Date: Mon, 24 Feb 2025 05:00:52 GMT
- Title: Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models
- Authors: Yaqi Sun, Kyohei Atarashi, Koh Takeuchi, Hisashi Kashima,
- Abstract summary: Large Vision-Language Models (LVLMs) integrate image encoders with Large Language Models (LLMs) to process multi-modal inputs and perform complex visual tasks.<n>They often generate hallucinations by describing non-existent objects or attributes, compromising their reliability.<n>This study analyzes hallucination patterns in image captioning, showing that not all tokens in the generation process are influenced by image input.
- Score: 24.241691571850403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) integrate image encoders with Large Language Models (LLMs) to process multi-modal inputs and perform complex visual tasks. However, they often generate hallucinations by describing non-existent objects or attributes, compromising their reliability. This study analyzes hallucination patterns in image captioning, showing that not all tokens in the generation process are influenced by image input and that image dependency can serve as a useful signal for hallucination detection. To address this, we develop an automated pipeline to identify hallucinated objects and train a token-level classifier using hidden representations from parallel inference passes-with and without image input. Leveraging this classifier, we introduce a decoding strategy that effectively controls hallucination rates in image captioning at inference time.
Related papers
- Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations [82.42811602081692]
This paper introduces a subsequence association framework to systematically trace and understand hallucinations.
Key insight is hallucinations that arise when dominant hallucinatory associations outweigh faithful ones.
We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts.
arXiv Detail & Related papers (2025-04-17T06:34:45Z) - PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training [56.172959986096316]
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs)
HalFscore is a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level.
PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations.
arXiv Detail & Related papers (2025-03-09T07:07:03Z) - Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow [32.039946174953236]
Large vision-language models show tremendous potential in understanding visual information through human languages.
They are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image.
We propose Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing hallucination noise.
arXiv Detail & Related papers (2025-02-28T05:56:23Z) - Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.
These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.
We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning.
They often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination.
Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage.
We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfD
arXiv Detail & Related papers (2024-11-24T13:42:02Z) - From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models [15.401221354325672]
Hallucinations in large vision models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input.
Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to extract or decouple visual features.
In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling)
arXiv Detail & Related papers (2024-10-09T11:46:32Z) - Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization [123.54980913741828]
Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data.
They invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images.
Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information.
However, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations.
arXiv Detail & Related papers (2024-05-24T08:46:31Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [40.930238150365795]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.<n>We generate a small-size hallucination annotation dataset by proprietary models.<n>Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding [36.81476620057058]
Large Vision-Language Models (LVLMs) are susceptible to object hallucinations.
Current approaches often rely on the model's token likelihoods or other internal information.
We introduce our CLIP-Guided Decoding approach to reduce object hallucination at decoding time.
arXiv Detail & Related papers (2024-02-23T12:57:16Z) - Mitigating Open-Vocabulary Caption Hallucinations [33.960405731583656]
We propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting.
Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations.
To mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa.
arXiv Detail & Related papers (2023-12-06T17:28:03Z) - AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall.
We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.