Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
- URL: http://arxiv.org/abs/2412.11124v2
- Date: Sat, 21 Dec 2024 05:26:14 GMT
- Title: Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning
- Authors: Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, Tat-Seng Chua,
- Abstract summary: multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks.
This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs.
Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
- Score: 151.4060202671114
- License:
- Abstract: Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.
Related papers
- CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios.
It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z) - A Survey of Hallucination in Large Visual Language Models [48.794850395309076]
The existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields.
The structure of LVLMs and main causes of hallucination generation are introduced.
The available hallucination evaluation benchmarks for LVLMs are presented.
arXiv Detail & Related papers (2024-10-20T10:58:58Z) - From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models [15.401221354325672]
Hallucinations in large vision models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input.
Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to extract or decouple visual features.
In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling)
arXiv Detail & Related papers (2024-10-09T11:46:32Z) - MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost.
Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works.
We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z) - Hallucination of Multimodal Large Language Models: A Survey [40.73148186369018]
multimodal large language models (MLLMs) have demonstrated significant advancements and remarkable abilities in multimodal tasks.
Despite these promising developments, MLLMs often generate outputs that are inconsistent with the visual content.
This survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.
arXiv Detail & Related papers (2024-04-29T17:59:41Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z) - A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions [40.79317187623401]
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP)
LLMs are prone to hallucination, generating plausible yet nonfactual content.
This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval systems.
arXiv Detail & Related papers (2023-11-09T09:25:37Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.