When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding
- URL: http://arxiv.org/abs/2506.05551v1
- Date: Thu, 05 Jun 2025 19:53:19 GMT
- Title: When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding
- Authors: Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, Nicu Sebe,
- Abstract summary: Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning.<n>When confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content.<n>We propose a training-free semantic hallucination mitigation framework comprising two key components.
- Score: 72.15848305976706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
Related papers
- A Survey of Multimodal Hallucination Evaluation and Detection [52.03164192840023]
Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information.<n>These models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge.<n>This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks.
arXiv Detail & Related papers (2025-07-25T07:22:42Z) - HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation [2.2006360539727923]
HalluSegBench is the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning.<n>Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes.<n> Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones.
arXiv Detail & Related papers (2025-06-26T17:59:12Z) - Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images [6.48620624181578]
We introduce SHE (Sequence Hallucination Eradication), a lightweight framework that detects hallucinations and mitigates them.<n>We also propose a new metric (BEACH) to quantify behavioral hallucination severity.
arXiv Detail & Related papers (2025-06-08T15:08:52Z) - HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination"<n>This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis [14.033320167387194]
A major challenge in their real-world application is hallucination, where LVLMs generate non-existent visual elements, eroding user trust.<n>We hypothesize that hidden factors, such as objects, contexts, and semantic foreground-background structures, induce hallucination.<n>By analyzing the causality between images, text prompts, and network saliency, we systematically explore interventions to block these factors.
arXiv Detail & Related papers (2024-12-04T01:23:57Z) - Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? [53.89380284760555]
Large vision-language models (LVLMs) produce captions that mention concepts that cannot be found in the image.
These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption.
Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination.
arXiv Detail & Related papers (2024-06-20T16:56:11Z) - AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models [91.78328878860003]
Large vision-language models (LVLMs) are prone to hallucinations.
benchmarks often rely on hand-crafted corner cases whose failure patterns may not generalize well.
We develop AutoHallusion, the first automated benchmark generation approach.
arXiv Detail & Related papers (2024-06-16T11:44:43Z) - Hallucination Augmented Contrastive Learning for Multimodal Large
Language Model [53.65682783591723]
Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks.
However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information.
In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning.
arXiv Detail & Related papers (2023-12-12T04:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.