Related papers: NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

URL: http://arxiv.org/abs/2511.06475v1
Date: Sun, 09 Nov 2025 17:41:11 GMT
Title: NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models
Authors: Kyuho Lee, Euntae Kim, Jinwoo Choi, Buru Chang,
Abstract summary: Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering.<n>Many models and training methods explicitly encourage continuity across events to enhance narrative coherence.<n>We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context.
Score: 8.6767620170781
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.

Related papers

VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models [8.155587933125673]
Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos.<n>We introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu.
arXiv Detail & Related papers (2026-01-15T02:40:41Z)
Codified Foreshadowing-Payoff Text Generation [67.01182739162142]
Foreshadowing and payoff are ubiquitous narrative devices through which authors introduce commitments early in a story and resolve them through concrete, observable outcomes.<n>Existing evaluations largely overlook this structural failure, focusing on surface-level coherence rather than the logical fulfillment of narrative setups.<n>We introduce Codified Foreshadowing-Payoff Generation, a novel framework that reframes narrative quality through the lens of payoff realization.
arXiv Detail & Related papers (2026-01-11T19:05:37Z)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z)
ARGUS: Hallucination and Omission Evaluation in Video-LLMs [86.73977434293973]
ARGUS is a VideoLLM benchmark that measures freeform video captioning performance.<n>By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics.
arXiv Detail & Related papers (2025-06-09T02:42:13Z)
On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.<n>We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z)
EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z)
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative [19.79736018383692]
Existing video captioning benchmarks and models lack causal-temporal narrative.<n>This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content.<n>We propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics.
arXiv Detail & Related papers (2024-06-10T17:34:24Z)
Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips. We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z)
Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z)
Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text. We build a new model that can better learn video-span correlation without manually designed features. Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z)
Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.