Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
- URL: http://arxiv.org/abs/2410.03577v1
- Date: Fri, 4 Oct 2024 16:30:54 GMT
- Title: Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
- Authors: Xin Zou, Yizhou Wang, Yibo Yan, Sirui Huang, Kening Zheng, Junkai Chen, Chang Tang, Xuming Hu,
- Abstract summary: Multimodal Large Language Models (MLLMs) are susceptible to hallucinations, especially assertively fabricating content not present in the visual inputs.
We introduce Memory-space Visual Retracing (MemVR), a novel hallucination mitigation paradigm that without the need for external knowledge retrieval or additional fine-tuning.
In particular, we treat visual prompts as supplementary evidence to be reinjected into MLLMs via Feed Forward Network (FFN) as key-value memory, when the model is uncertain or even amnesic about question-relevant visual memories.
- Score: 25.386858937068478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are susceptible to hallucinations, especially assertively fabricating content not present in the visual inputs. To address the aforementioned challenge, we follow a common cognitive process - when one's initial memory of critical on-sight details fades, it is intuitive to look at them a second time to seek a factual and accurate answer. Therefore, we introduce Memory-space Visual Retracing (MemVR), a novel hallucination mitigation paradigm that without the need for external knowledge retrieval or additional fine-tuning. In particular, we treat visual prompts as supplementary evidence to be reinjected into MLLMs via Feed Forward Network (FFN) as key-value memory, when the model is uncertain or even amnesic about question-relevant visual memories. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks without incurring added time overhead, thus emphasizing its potential for widespread applicability.
Related papers
- Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.
These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.
We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)
We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.
We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.
This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.
Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z) - CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - Reducing Hallucinations in Vision-Language Models via Latent Space Steering [34.1755878632361]
Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications.
We introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features.
arXiv Detail & Related papers (2024-10-21T08:42:30Z) - A Survey of Hallucination in Large Visual Language Models [48.794850395309076]
The existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields.
The structure of LVLMs and main causes of hallucination generation are introduced.
The available hallucination evaluation benchmarks for LVLMs are presented.
arXiv Detail & Related papers (2024-10-20T10:58:58Z) - MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.<n>We present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers.<n>Motivated by this, we propose a novel dynamic correction decoding method for MLLMs DeCo, which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models [15.401221354325672]
Hallucinations in large vision models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input.
Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to extract or decouple visual features.
In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling)
arXiv Detail & Related papers (2024-10-09T11:46:32Z) - Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning [24.270713960060142]
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension.
They still suffer from hallucination problems referring to generating inconsistent outputs with the image content.
We propose a training-free framework, textbfMVP, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs.
arXiv Detail & Related papers (2024-08-30T09:40:10Z) - Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models [13.48296910438554]
Hallucination issues persistently plagued current multimodal large language models (MLLMs)
We introduce Reefknot, a benchmark specifically targeting relation hallucinations, consisting of over 20,000 samples derived from real-world scenarios.
Our comparative evaluation across three distinct tasks revealed a substantial shortcoming in the capabilities of current MLLMs to mitigate relation hallucinations.
arXiv Detail & Related papers (2024-08-18T10:07:02Z) - Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs.
Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types.
Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z) - Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models [33.19894606649144]
Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations.
We propose a framework called MMHalball to evaluate LVLMs' behaviors when encountering generated hallucinations.
We propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input.
arXiv Detail & Related papers (2024-06-30T03:04:11Z) - Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs [52.497823009176074]
Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations.
We introduce Visual Description Grounded Decoding (VDGD), a training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs.
arXiv Detail & Related papers (2024-05-24T16:21:59Z) - Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback [48.065569871444275]
We propose detecting and mitigating hallucinations in Large Vision Language Models (LVLMs) via fine-grained AI feedback.
We generate a small-size hallucination annotation dataset by proprietary models.
Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for training hallucination mitigating model.
arXiv Detail & Related papers (2024-04-22T14:46:10Z) - OPERA: Alleviating Hallucination in Multi-Modal Large Language Models
via Over-Trust Penalty and Retrospection-Allocation [124.9008419182485]
We present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy.
Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns in the self-attention matrix.
Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue.
arXiv Detail & Related papers (2023-11-29T18:57:07Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - AutoHall: Automated Hallucination Dataset Generation for Large Language Models [56.92068213969036]
This paper introduces a method for automatically constructing model-specific hallucination datasets based on existing fact-checking datasets called AutoHall.
We also propose a zero-resource and black-box hallucination detection method based on self-contradiction.
arXiv Detail & Related papers (2023-09-30T05:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.