Related papers: DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

URL: http://arxiv.org/abs/2512.03992v1
Date: Wed, 03 Dec 2025 17:22:29 GMT
Title: DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation
Authors: Zexin Lin, Hawen Wan, Yebin Zhong, Xiaoqiang,
Abstract summary: We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences.<n>DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency.<n>To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth.
Score: 0.7874708385247353
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.

Related papers

Same Answer, Different Representations: Hidden instability in VLMs [65.36933543377346]
We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
arXiv Detail & Related papers (2026-02-06T12:24:26Z)
All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations [102.94052335735326]
All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model.<n>Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes.<n>We introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time.
arXiv Detail & Related papers (2026-01-02T02:20:57Z)
Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models [0.0]
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content.<n>We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention.
arXiv Detail & Related papers (2025-12-08T13:58:46Z)
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations [52.1029745126386]
In vision-language-action (VLA) models, robustness to real-world perturbations is critical for deployment.<n>We propose RobustVLA against perturbations in VLA inputs and outputs.<n> Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone.
arXiv Detail & Related papers (2025-09-26T14:42:23Z)
DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models [45.126261544696185]
Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to perturbations poses a significant threat to their reliability in real-world applications.<n>This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs.
arXiv Detail & Related papers (2025-06-04T13:26:33Z)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [77.96693360763925]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation in video contexts.<n>Our work differs from existing video benchmarks through the following key features: Knowledge required: demanding integration of external knowledge beyond the video's explicit narrative.<n>Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance.
arXiv Detail & Related papers (2025-03-24T17:46:09Z)
Temporal-Consistent Video Restoration with Pre-trained Diffusion Models [51.47188802535954]
Video restoration (VR) aims to recover high-quality videos from degraded ones.<n>Recent zero-shot VR methods using pre-trained diffusion models (DMs) suffer from approximation errors during reverse diffusion and insufficient temporal consistency.<n>We present a novel a Posterior Maximum (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors.
arXiv Detail & Related papers (2025-03-19T03:41:56Z)
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study [44.170933007736984]
Vision-Language Models (VLMs) are powerful yet computationally intensive for widespread practical deployments.<n>Current acceleration evaluations primarily target minimal overall performance degradation, overlooking a crucial question: does the accelerated model still give the same answers to the same questions as it did before acceleration?<n>This is vital for stability-centered industrial applications where consistently correct answers for specific, known situations are paramount, such as in AI-based disease diagnosis.
arXiv Detail & Related papers (2025-03-09T22:16:48Z)
Temporal Inconsistency Guidance for Super-resolution Video Quality Assessment [63.811519474030234]
We propose a perception-oriented approach to quantify frame-wise temporal inconsistency.<n>Inspired by the human visual system, we develop an Inconsistency Guided Temporal Module.<n>Our method significantly outperforms state-of-the-art VQA approaches.
arXiv Detail & Related papers (2024-12-25T15:43:41Z)
DifFIQA: Face Image Quality Assessment Using Denoising Diffusion Probabilistic Models [1.217503190366097]
Face image quality assessment (FIQA) techniques aim to mitigate these performance degradations. We present a powerful new FIQA approach, named DifFIQA, which relies on denoising diffusion probabilistic models (DDPM) Because the diffusion-based perturbations are computationally expensive, we also distill the knowledge encoded in DifFIQA into a regression-based quality predictor, called DifFIQA(R)
arXiv Detail & Related papers (2023-05-09T21:03:13Z)
Intrinsic Temporal Regularization for High-resolution Human Video Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain. We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation. We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.