SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse
- URL: http://arxiv.org/abs/2512.18671v1
- Date: Sun, 21 Dec 2025 10:25:02 GMT
- Title: SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse
- Authors: Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, Min Yang,
- Abstract summary: We propose SmartSight, a pioneering step to address the issue of hallucinations in Video Large Language Models.<n>SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding.<n>Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%.
- Score: 22.663181163109176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.
Related papers
- Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding [23.767895980891264]
We propose a decoding strategy termed Stemporaltemporal-Semantic Contrastive Decoding.<n>This strategy constructs negative features by deliberately disrupting the novel consistency and semantic associations of video features.<n>Our method only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.
arXiv Detail & Related papers (2026-01-30T05:16:12Z) - SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding [30.820850789099932]
We propose a training-free method that adaptively enhances temporal and spatial faithfulness for each output token.<n>SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks.
arXiv Detail & Related papers (2025-12-04T10:17:20Z) - PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning [87.35309934860938]
hallucinations in large language models (MLLMs) are strongly associated with insufficient attention allocated to visual tokens.<n>We propose textbfPruneHal, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model's focus on critical visual information.
arXiv Detail & Related papers (2025-10-22T02:41:07Z) - Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding [103.74753205276336]
We propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination.<n>Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent.<n>Dr.V-Agent detects hallucinations by applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning.
arXiv Detail & Related papers (2025-09-15T12:39:19Z) - ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding [61.526407756322264]
We introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination.<n>We find that models are more prone to SAH on rapidly changing semantics.<n>We also achieve improvements on both ELV-Halluc and Video-MME.
arXiv Detail & Related papers (2025-08-29T10:25:03Z) - Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering [83.63437999696954]
hallucination in large language models (MLLMs) persists as a significant and under-addressed challenge in the video domain.<n>We propose a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules.
arXiv Detail & Related papers (2025-05-19T08:12:06Z) - Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation [49.885797244626694]
hallucination of large multimodal models (LMMs) provides responses that appear correct but are actually incorrect.<n>This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text.
arXiv Detail & Related papers (2025-03-25T13:12:17Z) - EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.