VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
- URL: http://arxiv.org/abs/2505.01481v4
- Date: Sun, 26 Oct 2025 04:54:22 GMT
- Title: VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
- Authors: Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber,
- Abstract summary: Real visual understanding is essential for AI systems that interact with the physical world.<n>Current evaluations mostly use real-world videos similar to training data.<n>We propose negative-control tests using videos that depict physically impossible or logically inconsistent events.
- Score: 70.00000053847738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) have achieved strong results in video understanding, yet a key question remains: do they truly comprehend visual content or only learn shallow correlations between vision and language? Real visual understanding, especially of physics and common sense, is essential for AI systems that interact with the physical world. Current evaluations mostly use real-world videos similar to training data, so high benchmark scores may not reflect real reasoning ability. To address this, we propose negative-control tests using videos that depict physically impossible or logically inconsistent events. We introduce VideoHallu, a synthetic dataset of physics- and commonsense-violating scenes generated with Veo2, Sora, and Kling. It includes expert-annotated question-answer pairs across four categories of violations. Tests of leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1) show that, despite strong results on benchmarks such as MVBench and MMVU, they often miss these violations, exposing gaps in visual reasoning. Reinforcement learning fine-tuning on VideoHallu improves recognition of such violations without reducing standard benchmark performance. Our data is available at https://github.com/zli12321/VideoHallu.git.
Related papers
- UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics [19.634532810889507]
This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes.<n> UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions.<n>The dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second.
arXiv Detail & Related papers (2026-02-24T17:33:12Z) - SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding [30.820850789099932]
We propose a training-free method that adaptively enhances temporal and spatial faithfulness for each output token.<n>SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks.
arXiv Detail & Related papers (2025-12-04T10:17:20Z) - Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding [35.20942192333083]
We introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward.<n>Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps.<n>For the risk of hallucinations, the Factual-Aware Evaluator evaluates each perception result as a reliable anti-hallucination reward.
arXiv Detail & Related papers (2025-11-23T14:14:14Z) - GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? [76.67205289006795]
GLIMPSE consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories.<n>All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context.<n>In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges.
arXiv Detail & Related papers (2025-07-13T04:44:57Z) - Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? [56.06537213958482]
We present Video-Holmes, a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.<n>Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films.<n>Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information.
arXiv Detail & Related papers (2025-05-27T16:05:01Z) - RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video [19.373906873461703]
RTV-Bench is a fine-grained benchmark for MLLM real-time video analysis.<n>RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs.
arXiv Detail & Related papers (2025-05-04T10:55:21Z) - MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models.<n>Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions.<n>We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z) - VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation [66.58048825989239]
VideoPhy-2 is an action-centric dataset for evaluating physical commonsense in generated videos.<n>We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos.<n>Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance.
arXiv Detail & Related papers (2025-03-09T22:49:12Z) - VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding [1.1834200163382398]
We introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding.<n> VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition.<n>We propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference.
arXiv Detail & Related papers (2024-12-04T22:03:19Z) - VidHal: Benchmarking Temporal Hallucinations in Vision LLMs [9.392258475822915]
Large hallucination Language Models (VLLMs) are widely acknowledged to be prone to hallucinations.<n>We introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in temporal dynamics.<n>A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video.
arXiv Detail & Related papers (2024-11-25T06:17:23Z) - ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models [13.04745908368858]
We introduce ViBe: a large-scale dataset of hallucinated videos from open-source T2V models.<n>Using ten T2V models, we generated and manually annotated 3,782 videos from 837 MS captions.<n>Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings.
arXiv Detail & Related papers (2024-11-16T19:23:12Z) - VideoQA in the Era of LLMs: An Empirical Study [108.37456450182054]
Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-intuitive tasks.<n>This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA.<n>Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents.<n>However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments.
arXiv Detail & Related papers (2024-08-08T05:14:07Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.08008540513596]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z) - FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs [57.59518049930211]
We propose the first adversarial attack tailored for video-based large language models (LLMs)
Our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations.
Our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate.
arXiv Detail & Related papers (2024-03-20T11:05:07Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.