Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
- URL: http://arxiv.org/abs/2511.23478v1
- Date: Fri, 28 Nov 2025 18:59:58 GMT
- Title: Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
- Authors: Muhammad Maaz, Hanoona Rasheed, Fahad Shahbaz Khan, Salman Khan,
- Abstract summary: Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
- Score: 56.851611990473174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.
Related papers
- Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models [14.21980212001207]
Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs.<n>To better match streaming inputs, we propose textbfThink-as-You-See (TaYS), a unified framework enabling true concurrent reasoning.
arXiv Detail & Related papers (2026-03-03T11:24:55Z) - Process-of-Thought Reasoning for Videos [33.74677144833003]
Process-of-Thought (PoT) Reasoning for Videos is a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps.<n>PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence.
arXiv Detail & Related papers (2026-02-07T20:25:46Z) - CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation [6.356820150960838]
We introduce two complementary approaches inspired by test-time scaling to stabilize vision-language models.<n>CASHEW is an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces.<n>CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence.
arXiv Detail & Related papers (2026-01-12T21:24:45Z) - AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning [3.949628618389608]
AURA is a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs)<n>AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling.<n>We propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity.
arXiv Detail & Related papers (2025-08-10T20:06:42Z) - VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos [89.39873803375498]
VideoMathQA is a benchmark designed to evaluate whether models can perform temporally extended cross-modal reasoning on videos.<n>The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour.<n>It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities.
arXiv Detail & Related papers (2025-06-05T17:59:58Z) - VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models [29.706347050700867]
We introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench)<n>VCRBench tests whether Large Video Language Models (LVLMs) can identify, reason about, and correctly sequence the events needed to accomplish a specific goal.<n>We propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning.
arXiv Detail & Related papers (2025-05-13T11:35:58Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.<n>We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.