OPERA: Alleviating Hallucination in Multi-Modal Large Language Models
via Over-Trust Penalty and Retrospection-Allocation
- URL: http://arxiv.org/abs/2311.17911v3
- Date: Tue, 12 Mar 2024 05:59:46 GMT
- Title: OPERA: Alleviating Hallucination in Multi-Modal Large Language Models
via Over-Trust Penalty and Retrospection-Allocation
- Authors: Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi
Wang, Dahua Lin, Weiming Zhang, Nenghai Yu
- Abstract summary: We present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy.
Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns in the self-attention matrix.
Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue.
- Score: 124.9008419182485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hallucination, posed as a pervasive challenge of multi-modal large language
models (MLLMs), has significantly impeded their real-world usage that demands
precise judgment. Existing methods mitigate this issue with either training
with specific designed data or inferencing with external knowledge from other
sources, incurring inevitable additional costs. In this paper, we present
OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a
Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate
the hallucination issue without additional data, knowledge, or training. Our
approach begins with an interesting observation that, most hallucinations are
closely tied to the knowledge aggregation patterns manifested in the
self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a
few summary tokens, but not all the previous tokens. Such partial over-trust
inclination results in the neglecting of image tokens and describes the image
content with hallucination. Based on the observation, OPERA introduces a
penalty term on the model logits during the beam-search decoding to mitigate
the over-trust issue, along with a rollback strategy that retrospects the
presence of summary tokens in the previously generated tokens, and re-allocate
the token selection if necessary. With extensive experiments, OPERA shows
significant hallucination-mitigating performance on different MLLMs and
metrics, proving its effectiveness and generality. Our code is available at:
https://github.com/shikiw/OPERA.
Related papers
- Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position.
RoPE, a widely adopted positional dependency modeling design.
We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z) - MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.
We propose a novel dynamic correction decoding method for MLLMs (DeCo)
We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models [25.386858937068478]
Multimodal Large Language Models (MLLMs) are susceptible to hallucinations, especially assertively fabricating content not present in the visual inputs.
We introduce Memory-space Visual Retracing (MemVR), a novel hallucination mitigation paradigm that without the need for external knowledge retrieval or additional fine-tuning.
In particular, we treat visual prompts as supplementary evidence to be reinjected into MLLMs via Feed Forward Network (FFN) as key-value memory, when the model is uncertain or even amnesic about question-relevant visual memories.
arXiv Detail & Related papers (2024-10-04T16:30:54Z) - Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning [24.270713960060142]
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension.
They still suffer from hallucination problems referring to generating inconsistent outputs with the image content.
We propose a training-free framework, textbfMVP, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs.
arXiv Detail & Related papers (2024-08-30T09:40:10Z) - Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models [30.26685485474035]
Large Vision-Language Models (LVLMs) have rapidly advanced in recent years.
The prevalent issue known as the hallucination' problem has emerged as a significant bottleneck.
We propose a simple yet effective method named Self-Introspective Decoding (SID)
arXiv Detail & Related papers (2024-08-04T13:50:17Z) - Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs.
Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types.
Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z) - MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost.
Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works.
We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z) - Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields.
LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations.
We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.