Related papers: SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision

SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision

URL: http://arxiv.org/abs/2508.03177v1
Date: Tue, 05 Aug 2025 07:41:25 GMT
Title: SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision
Authors: Zhaoxu Li, Chenqi Kong, Yi Yu, Qiangqiang Wu, Xinghao Jiang, Ngai-Man Cheung, Bihan Wen, Alex Kot, Xudong Jiang,
Abstract summary: Style-Aware Visual Early Revision SAVER is a novel mechanism that dynamically adjusts LVLMs' final outputs based on the token-level visual attention patterns.<n>We show that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.
Score: 59.61988843996952
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their photographic counterparts. To address this issue, we propose Style-Aware Visual Early Revision SAVER, a novel mechanism that dynamically adjusts LVLMs' final outputs based on the token-level visual attention patterns, leveraging early-layer feedback to mitigate hallucinations caused by stylized images. Extensive experiments demonstrate that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.

Related papers

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z)
Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z)
VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding [38.23310445372371]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal task reasoning.<n>They often generate responses that appear plausible yet do not accurately reflect the visual content, a phenomenon known as hallucination.<n>Recent approaches have introduced training-free methods to mitigate hallucinations by adjusting the decoding strategy during the inference stage.<n>We propose a novel hallucination-mitigation method from the visual encoding perspective: textbfVisutextbfal textbfLayer Fustextbfion Contrastive textbfD
arXiv Detail & Related papers (2024-11-24T13:42:02Z)
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models [15.401221354325672]
Hallucinations in large vision models (LVLMs) are a significant challenge, i.e., generating objects that are not presented in the visual input. Recent studies often attribute hallucinations to a lack of understanding of visual input, yet ignore a more fundamental issue: the model's inability to extract or decouple visual features. In this paper, we revisit the hallucinations in LVLMs from an architectural perspective, investigating whether the primary cause lies in the visual encoder (feature extraction) or the modal alignment module (feature decoupling)
arXiv Detail & Related papers (2024-10-09T11:46:32Z)
HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding [36.360171373963716]
Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks. These models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images. We propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD) to address this issue.
arXiv Detail & Related papers (2024-09-30T15:52:05Z)
Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization [123.54980913741828]
Large Visual Language Models (LVLMs) have demonstrated exceptional abilities in understanding multimodal data.<n>They invariably suffer from hallucinations, leading to a disconnect between the generated text and the corresponding images.<n>Almost all current visual contrastive decoding methods attempt to mitigate these hallucinations by introducing visual uncertainty information.<n>However, they struggle to precisely induce the hallucinatory tokens, which severely limits their effectiveness in mitigating hallucinations.
arXiv Detail & Related papers (2024-05-24T08:46:31Z)
Visually Dehallucinative Instruction Generation [0.8192907805418583]
This paper presents a novel and scalable method for generating visually dehallucinative instructions, dubbed CAP2QA, that constrains the scope to only image contents. It shows that our proposed method significantly reduces visual hallucination while consistently improving visual recognition ability and expressiveness.
arXiv Detail & Related papers (2024-02-13T10:25:45Z)
Mitigating Hallucination in Visual Language Models with Visual Supervision [33.05550629039951]
Large vision-language models (LVLMs) suffer from hallucination a lot. Key problem lies in its weak ability to comprehend detailed content in a multi-modal context. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs.
arXiv Detail & Related papers (2023-11-27T09:30:02Z)
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training [66.0036211069513]
Large-scale vision-language pre-trained models are prone to hallucinate non-existent visual objects when generating text. We show that models achieving better scores on standard metrics could hallucinate objects more frequently. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination.
arXiv Detail & Related papers (2022-10-14T10:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.