Related papers: VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models

VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models

URL: http://arxiv.org/abs/2601.10010v1
Date: Thu, 15 Jan 2026 02:40:41 GMT
Title: VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models
Authors: Zefan Zhang, Kehua Zhu, Shijie Jiang, Hongyuan Lu, Shengkai Sun, Tian Bai,
Abstract summary: Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos.<n>We introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu.
Score: 8.155587933125673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.

Related papers

GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking [45.90413025033315]
Video reasoning requires understanding the causal relationships between events in a video.<n>Existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries.<n>We propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs.
arXiv Detail & Related papers (2026-02-19T17:09:30Z)
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding [30.820850789099932]
We propose a training-free method that adaptively enhances temporal and spatial faithfulness for each output token.<n>SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks.
arXiv Detail & Related papers (2025-12-04T10:17:20Z)
NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models [8.6767620170781]
Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering.<n>Many models and training methods explicitly encourage continuity across events to enhance narrative coherence.<n>We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context.
arXiv Detail & Related papers (2025-11-09T17:41:11Z)
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding [61.526407756322264]
We introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination.<n>We find that models are more prone to SAH on rapidly changing semantics.<n>We also achieve improvements on both ELV-Halluc and Video-MME.
arXiv Detail & Related papers (2025-08-29T10:25:03Z)
What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models? [95.46087552542998]
This paper introduces the Hallucination searching-based Object Probing Evaluation benchmark.<n>It aims to generate the most misleading distractors that can trigger hallucination in Large Vision-Language Models.<n> Experimental results show that HOPE leads to a precision drop of at least 9% and up to 23% across various state-of-the-art LVLMs.
arXiv Detail & Related papers (2025-08-03T03:11:48Z)
EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z)
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs [54.50483041708911]
Hallu-PI is the first benchmark designed to evaluate hallucination in MLLMs within Perturbed Inputs. Hallu-PI consists of seven perturbed scenarios, containing 1,260 perturbed images from 11 object types. Our research reveals a severe bias in MLLMs' ability to handle different types of hallucinations.
arXiv Detail & Related papers (2024-08-02T16:07:15Z)
Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models [65.32990889402927]
We coin this phenomenon as knowledge overshadowing'' We show that the hallucination rate grows with both the imbalance ratio and the length of dominant condition description. We propose to utilize overshadowing conditions as a signal to catch hallucination before it is produced.
arXiv Detail & Related papers (2024-07-10T20:37:42Z)
Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models [69.79709804046325]
We introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object.
arXiv Detail & Related papers (2024-06-24T08:42:42Z)
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs) VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z)
Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.