SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
- URL: http://arxiv.org/abs/2512.04643v1
- Date: Thu, 04 Dec 2025 10:17:20 GMT
- Title: SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
- Authors: Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, Yu-Chiang Frank Wang,
- Abstract summary: We propose a training-free method that adaptively enhances temporal and spatial faithfulness for each output token.<n>SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks.
- Score: 30.820850789099932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
Related papers
- Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models [44.84227796501077]
We introduce OmniVCHall, a benchmark designed to evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs)<n>We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism.<n> Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%.
arXiv Detail & Related papers (2026-01-31T06:50:43Z) - Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding [23.767895980891264]
We propose a decoding strategy termed Stemporaltemporal-Semantic Contrastive Decoding.<n>This strategy constructs negative features by deliberately disrupting the novel consistency and semantic associations of video features.<n>Our method only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.
arXiv Detail & Related papers (2026-01-30T05:16:12Z) - SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse [22.663181163109176]
We propose SmartSight, a pioneering step to address the issue of hallucinations in Video Large Language Models.<n>SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding.<n>Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%.
arXiv Detail & Related papers (2025-12-21T10:25:02Z) - Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding [41.828387997311474]
Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering.<n>They remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects.<n>We propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference.
arXiv Detail & Related papers (2025-10-21T06:11:24Z) - Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding [103.74753205276336]
We propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination.<n>Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent.<n>Dr.V-Agent detects hallucinations by applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning.
arXiv Detail & Related papers (2025-09-15T12:39:19Z) - MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models [56.49314029765706]
We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically.<n>MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances.<n>We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos.
arXiv Detail & Related papers (2025-09-10T12:34:07Z) - ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding [61.526407756322264]
We introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination.<n>We find that models are more prone to SAH on rapidly changing semantics.<n>We also achieve improvements on both ELV-Halluc and Video-MME.
arXiv Detail & Related papers (2025-08-29T10:25:03Z) - SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z) - EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models [91.78328878860003]
Large vision-language models (LVLMs) are prone to hallucinations.
benchmarks often rely on hand-crafted corner cases whose failure patterns may not generalize well.
We develop AutoHallusion, the first automated benchmark generation approach.
arXiv Detail & Related papers (2024-06-16T11:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.