When Thinking Drifts: Evidential Grounding for Robust Video Reasoning
- URL: http://arxiv.org/abs/2510.06077v1
- Date: Tue, 07 Oct 2025 16:03:33 GMT
- Title: When Thinking Drifts: Evidential Grounding for Robust Video Reasoning
- Authors: Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman,
- Abstract summary: Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks.<n>CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues.<n>Visual Evidence Reward (VER) is a reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence.
- Score: 68.75730050161219
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video reasoning, the task of enabling machines to infer from dynamic visual content through multi-step logic, is crucial for advanced AI. While the Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks, its application to video understanding remains underexplored. This paper presents a systematic analysis revealing that CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues, and leading to hallucinated visual details and overridden correct intuitions - a phenomenon we term "visual thinking drift". We explain this drift through a Bayesian lens, positing that CoT traces often diverge from actual visual evidence, instead amplifying internal biases or language priors, causing models to storytell rather than engage in grounded reasoning. To counteract this, we introduce Visual Evidence Reward (VER), a novel reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence. Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence - for large multimodal models that not only "think before answering", but also "see while thinking".
Related papers
- Process-of-Thought Reasoning for Videos [33.74677144833003]
Process-of-Thought (PoT) Reasoning for Videos is a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps.<n>PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence.
arXiv Detail & Related papers (2026-02-07T20:25:46Z) - Rethinking Chain-of-Thought Reasoning for Videos [19.579424881079447]
Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing.<n>Recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning.<n>Motivated by empirical observations, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning.
arXiv Detail & Related papers (2025-12-10T13:05:55Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - Video-CoM: Interactive Video Reasoning via Chain of Manipulations [78.64256470920166]
We introduce Interactive Video Reasoning, enabling models to "think with videos"<n>Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence.<n>Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of art models.
arXiv Detail & Related papers (2025-11-28T18:59:57Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models [50.42183477287337]
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning.<n>We introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT)<n>We show that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm.
arXiv Detail & Related papers (2025-07-14T03:21:13Z) - ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z) - Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z) - DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [11.242852367476015]
DeepEyes is a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning.<n>We propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.<n>DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T13:48:11Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - Visual Abductive Reasoning [85.17040703205608]
Abductive reasoning seeks the likeliest possible explanation for partial observations.
We propose a new task and dataset, Visual Abductive Reasoning ( VAR), for examining abductive reasoning ability of machine intelligence in everyday visual situations.
arXiv Detail & Related papers (2022-03-26T10:17:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.