Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing
- URL: http://arxiv.org/abs/2505.21547v1
- Date: Sat, 24 May 2025 22:36:15 GMT
- Title: Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing
- Authors: Weixing Wang, Zifeng Ding, Jindong Gu, Rui Cao, Christoph Meinel, Gerard de Melo, Haojin Yang,
- Abstract summary: Large Vision-Language Models (LVLMs) unify multimodal representations by encoding visual inputs into a finite set of tokens.<n>We find that these models still hallucinate non-existent objects.<n>We propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation.
- Score: 39.969451863788464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at https://github.com/weixingW/CGC-VTD/tree/main
Related papers
- SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision [59.61988843996952]
Style-Aware Visual Early Revision SAVER is a novel mechanism that dynamically adjusts LVLMs' final outputs based on the token-level visual attention patterns.<n>We show that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.
arXiv Detail & Related papers (2025-08-05T07:41:25Z) - Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding [33.33247964758369]
We argue that adequate contextual information can be extracted directly from the token interaction process.<n>Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens.<n>We present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens.
arXiv Detail & Related papers (2025-05-22T13:19:57Z) - TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection [6.006482486396196]
We propose Temporal Attention Real-time Accumulative Connection (TARAC) to mitigate hallucinations caused by the decay of attention on image tokens.<n>We validate TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations.
arXiv Detail & Related papers (2025-04-05T07:57:11Z) - Hallucinatory Image Tokens: A Training-free EAZY Approach on Detecting and Mitigating Object Hallucinations in LVLMs [15.479587108655393]
Large Vision-Language Models (LVLMs) still face challenges with object hallucination.<n>Our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations.<n>We introduce EAZY, a novel, training-free method that automatically identifies and Eliminates hAllucinations by Zeroing out hallucinatorY image tokens.
arXiv Detail & Related papers (2025-03-10T18:53:39Z) - Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models [24.241691571850403]
Large Vision-Language Models (LVLMs) integrate image encoders with Large Language Models (LLMs) to process multi-modal inputs and perform complex visual tasks.<n>They often generate hallucinations by describing non-existent objects or attributes, compromising their reliability.<n>This study analyzes hallucination patterns in image captioning, showing that not all tokens in the generation process are influenced by image input.
arXiv Detail & Related papers (2025-02-24T05:00:52Z) - PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model [0.0]
hallucinations often arise from the progressive weakening of attention weight to visual tokens.<n>textbfPAINT (textbfPaying textbfAttention to textbfINformed textbfTokens) is a plug-and-play framework that intervenes in the self-attention mechanism of the Large Vision Language Models.
arXiv Detail & Related papers (2025-01-21T15:22:31Z) - Mitigating Object Hallucination via Concentric Causal Attention [71.27325347912823]
We show that object hallucination is closely tied with Rotary Position.
RoPE, a widely adopted positional dependency modeling design.
We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy.
arXiv Detail & Related papers (2024-10-21T11:54:53Z) - KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data.
Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z) - Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world.<n>Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation.<n>We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z) - OPERA: Alleviating Hallucination in Multi-Modal Large Language Models
via Over-Trust Penalty and Retrospection-Allocation [124.9008419182485]
We present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy.
Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns in the self-attention matrix.
Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue.
arXiv Detail & Related papers (2023-11-29T18:57:07Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.