Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction
- URL: http://arxiv.org/abs/2512.18813v1
- Date: Sun, 21 Dec 2025 17:05:42 GMT
- Title: Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction
- Authors: Guangtao Lyu, Xinyi Cheng, Chenghao Xu, Qi Liu, Muli Yang, Fen Fang, Huilin Chen, Jiexi Yan, Xu Yang, Cheng Deng,
- Abstract summary: Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge.<n>This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs.<n>We devise the VDC (d Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated ones to improve output reliability.
- Score: 59.801614364841775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.
Related papers
- Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation [51.743225614196774]
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning.<n>They remain vulnerable to hallucination, where generated content deviates from visual evidence.<n>Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding.<n>We propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs.
arXiv Detail & Related papers (2026-02-27T14:18:51Z) - HIME: Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing [6.021803204524807]
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal understanding capabilities.<n>LVLMs are prone to object hallucination, where models describe non-existent objects or attribute incorrect factual information.<n>We propose Hallucination Insensitivity Model Editing (HIME), a layer-adaptive weight editing approach that selectively modifies latent features to suppress hallucinations.
arXiv Detail & Related papers (2026-02-21T04:16:17Z) - Hallucination Begins Where Saliency Drops [18.189047289404325]
hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token.<n>We introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token.<n>Our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution.
arXiv Detail & Related papers (2026-01-28T05:50:52Z) - SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models [4.677212795400693]
Bag-of-Patches behavior of Visions under weak structural supervision acts as a contributing factor of object hallucinations.<n>We introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD)<n>By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias.
arXiv Detail & Related papers (2026-01-07T01:27:58Z) - FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering [14.550872089352943]
FaithSCAN is a lightweight network that detects hallucinations by exploiting rich internal signals of vision-language models.<n>We extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals.<n>In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding.
arXiv Detail & Related papers (2026-01-01T09:19:39Z) - SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z) - IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models [20.036659182106806]
We show that Large Vision-Language Models (LVLMs) exhibit a long-term bias where hallucinations increase as the sequence length grows.<n>We propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences.
arXiv Detail & Related papers (2025-08-05T14:05:15Z) - Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [65.4610281589017]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z) - The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering [42.09744951074433]
We investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process.<n>We propose VISTA, a training-free inference-time intervention framework that reduces hallucination while promoting genuine information.
arXiv Detail & Related papers (2025-02-05T21:34:02Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models [30.26685485474035]
Large Vision-Language Models (LVLMs) have rapidly advanced in recent years.<n>The prevalent issue known as the hallucination' problem has emerged as a significant bottleneck.<n>We propose a simple yet effective method named Self-Introspective Decoding (SID)
arXiv Detail & Related papers (2024-08-04T13:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.