Related papers: VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

URL: http://arxiv.org/abs/2510.10518v1
Date: Sun, 12 Oct 2025 09:29:50 GMT
Title: VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
Authors: Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu,
Abstract summary: multimodal reward models (RMs) have substantially improved post-training for visual generative models.<n>VideoReward Thinker (VR-Thinker) is a thinking-with-image framework that equips the RM with visual reasoning operations and a visual memory window.<n>Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks.
Score: 49.610569478718226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

Related papers

R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation [24.755888254171342]
Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process.<n>We propose R3G, a modular Reasoning-Retrieval-Reranking framework.<n>It produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.
arXiv Detail & Related papers (2026-01-25T12:12:12Z)
From Sight to Insight: Improving Visual Reasoning Capabilities of Multimodal Models via Reinforcement Learning [12.548754243700657]
multimodal large language models (MLLMs) generate reasoning that lacks integration of visual information.<n>We show that visual perception is the key bottleneck in such tasks, yielding gains of 26.7% for Claude 3.5 and 23.6% for Claude 3.7.<n>We design and evaluate six reward functions targeting different reasoning aspects, including image understanding, thinking steps, and answer accuracy.<n>Experiments on Qwen-2.5-VL-7B achieve 5.56% improvements over the base model, with consistent gains across both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2026-01-01T05:19:28Z)
ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm [73.4888880112019]
"Thinking with Video" paradigm bridges visual and textual reasoning in a unified temporal framework.<n>Sora-2 is established as a capable reasoner on vision-centric tasks.<n>On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
arXiv Detail & Related papers (2025-11-06T17:25:23Z)
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation [64.82775032985485]
Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations.<n>Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions.<n>We propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue.
arXiv Detail & Related papers (2025-10-10T13:34:23Z)
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z)
VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks [41.90092896728809]
We present VidBridge-R1, the first versatile video reasoning model that effectively bridges the "Reason-Then-Respond" paradigm conflict.<n>Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model.
arXiv Detail & Related papers (2025-06-10T03:57:53Z)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [33.70837005629285]
We propose video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks.<n>We develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions.<n>We also introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs.
arXiv Detail & Related papers (2025-02-17T13:07:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.