ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
- URL: http://arxiv.org/abs/2508.21496v2
- Date: Tue, 02 Sep 2025 17:14:38 GMT
- Title: ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
- Authors: Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu,
- Abstract summary: We introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination.<n>We find that models are more prone to SAH on rapidly changing semantics.<n>We also achieve improvements on both ELV-Halluc and Video-MME.
- Score: 61.526407756322264
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.
Related papers
- Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding [23.767895980891264]
We propose a decoding strategy termed Stemporaltemporal-Semantic Contrastive Decoding.<n>This strategy constructs negative features by deliberately disrupting the novel consistency and semantic associations of video features.<n>Our method only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.
arXiv Detail & Related papers (2026-01-30T05:16:12Z) - SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding [30.820850789099932]
We propose a training-free method that adaptively enhances temporal and spatial faithfulness for each output token.<n>SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks.
arXiv Detail & Related papers (2025-12-04T10:17:20Z) - Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding [103.74753205276336]
We propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination.<n>Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent.<n>Dr.V-Agent detects hallucinations by applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning.
arXiv Detail & Related papers (2025-09-15T12:39:19Z) - MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models [56.49314029765706]
We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically.<n>MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances.<n>We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos.
arXiv Detail & Related papers (2025-09-10T12:34:07Z) - Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation [49.885797244626694]
hallucination of large multimodal models (LMMs) provides responses that appear correct but are actually incorrect.<n>This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text.
arXiv Detail & Related papers (2025-03-25T13:12:17Z) - Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence [69.86946427928511]
We investigate the internal mechanisms driving hallucination in large vision-language models (LVLMs)<n>We introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context.<n>We propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads.
arXiv Detail & Related papers (2024-12-18T15:29:30Z) - VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding [1.1834200163382398]
We introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding.<n> VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition.<n>We propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference.
arXiv Detail & Related papers (2024-12-04T22:03:19Z) - EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.