Related papers: OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

URL: http://arxiv.org/abs/2311.17911v3
Date: Tue, 12 Mar 2024 05:59:46 GMT
Title: OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Authors: Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu
Abstract summary: We present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns in the self-attention matrix. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue.
Score: 124.9008419182485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucination, posed as a pervasive challenge of multi-modal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs. In this paper, we present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate the hallucination issue without additional data, knowledge, or training. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary. With extensive experiments, OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality. Our code is available at: https://github.com/shikiw/OPERA.

Related papers

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding [33.33247964758369]
We argue that adequate contextual information can be extracted directly from the token interaction process.<n>Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens.<n>We present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens.
arXiv Detail & Related papers (2025-05-22T13:19:57Z)
Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost. Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens. Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z)
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens. textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z)
Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning [151.4060202671114]
multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing vision-language tasks. This paper introduces a novel bottom-up reasoning framework to address hallucinations in MLLMs. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge.
arXiv Detail & Related papers (2024-12-15T09:10:46Z)
Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position. RoPE, a widely adopted positional dependency modeling design. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z)
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena. We propose a novel dynamic correction decoding method for MLLMs (DeCo) We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z)
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models [25.386858937068478]
Multimodal Large Language Models (MLLMs) are susceptible to hallucinations, especially assertively fabricating content not present in the visual inputs. We introduce Memory-space Visual Retracing (MemVR), a novel hallucination mitigation paradigm that without the need for external knowledge retrieval or additional fine-tuning. In particular, we treat visual prompts as supplementary evidence to be reinjected into MLLMs via Feed Forward Network (FFN) as key-value memory, when the model is uncertain or even amnesic about question-relevant visual memories.
arXiv Detail & Related papers (2024-10-04T16:30:54Z)
Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning [24.270713960060142]
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. They still suffer from hallucination problems referring to generating inconsistent outputs with the image content. We propose a training-free framework, textbfMVP, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs.
arXiv Detail & Related papers (2024-08-30T09:40:10Z)
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models [30.26685485474035]
Large Vision-Language Models (LVLMs) have rapidly advanced in recent years. The prevalent issue known as the hallucination' problem has emerged as a significant bottleneck. We propose a simple yet effective method named Self-Introspective Decoding (SID)
arXiv Detail & Related papers (2024-08-04T13:50:17Z)
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs. Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types. Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z)
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification [1.3654846342364308]
We introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2024-05-29T15:28:42Z)
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z)
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus [99.33091772494751]
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations. We propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs.
arXiv Detail & Related papers (2023-11-22T08:39:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.