VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
- URL: http://arxiv.org/abs/2510.10518v1
- Date: Sun, 12 Oct 2025 09:29:50 GMT
- Title: VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
- Authors: Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu,
- Abstract summary: multimodal reward models (RMs) have substantially improved post-training for visual generative models.<n>VideoReward Thinker (VR-Thinker) is a thinking-with-image framework that equips the RM with visual reasoning operations and a visual memory window.<n>Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks.
- Score: 49.610569478718226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.
Related papers
- R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation [24.755888254171342]
Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process.<n>We propose R3G, a modular Reasoning-Retrieval-Reranking framework.<n>It produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.
arXiv Detail & Related papers (2026-01-25T12:12:12Z) - From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning [12.548754243700657]
multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information.<n>We show that visual perception is the key bottleneck in such tasks, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7.<n>We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy.<n>Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2026-01-01T05:19:28Z) - ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm [73.4888880112019]
"Thinking with Video" paradigm bridges visual and textual reasoning in a unified temporal framework.<n>Sora-2 is established as a capable reasoner on vision-centric tasks.<n>On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
arXiv Detail & Related papers (2025-11-06T17:25:23Z) - VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation [64.82775032985485]
Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations.<n>Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions.<n>We propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue.
arXiv Detail & Related papers (2025-10-10T13:34:23Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks [41.90092896728809]
We present VidBridge-R1, the first versatile video reasoning model that effectively bridges the "Reason-Then-Respond" paradigm conflict.<n>Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model.
arXiv Detail & Related papers (2025-06-10T03:57:53Z) - ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z) - Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z) - video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [33.70837005629285]
We propose video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks.<n>We develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions.<n>We also introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs.
arXiv Detail & Related papers (2025-02-17T13:07:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.