Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding
- URL: http://arxiv.org/abs/2511.18463v2
- Date: Tue, 25 Nov 2025 11:57:42 GMT
- Title: Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding
- Authors: Bowei Pu, Chuanbin Liu, Yifan Ge, Peicheng Zhou, Yiwei Sun, Zhiying Lu, Jiankang Wang, Hongtao Xie,
- Abstract summary: We introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward.<n>Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps.<n>For the risk of hallucinations, the Factual-Aware Evaluator evaluates each perception result as a reliable anti-hallucination reward.
- Score: 35.20942192333083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.
Related papers
- Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models [44.84227796501077]
We introduce OmniVCHall, a benchmark designed to evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs)<n>We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism.<n> Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%.
arXiv Detail & Related papers (2026-01-31T06:50:43Z) - SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding [30.820850789099932]
We propose a training-free method that adaptively enhances temporal and spatial faithfulness for each output token.<n>SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks.
arXiv Detail & Related papers (2025-12-04T10:17:20Z) - MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models [56.49314029765706]
We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically.<n>MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances.<n>We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos.
arXiv Detail & Related papers (2025-09-10T12:34:07Z) - ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding [61.526407756322264]
We introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination.<n>We find that models are more prone to SAH on rapidly changing semantics.<n>We also achieve improvements on both ELV-Halluc and Video-MME.
arXiv Detail & Related papers (2025-08-29T10:25:03Z) - Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [78.78822033285938]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation [49.885797244626694]
hallucination of large multimodal models (LMMs) provides responses that appear correct but are actually incorrect.<n>This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text.
arXiv Detail & Related papers (2025-03-25T13:12:17Z) - EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z) - VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models [59.05674402770661]
This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)
VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis.
arXiv Detail & Related papers (2024-06-24T06:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.