Related papers: What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

URL: http://arxiv.org/abs/2602.08346v1
Date: Mon, 09 Feb 2026 07:31:14 GMT
Title: What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning
Authors: Yujin Zhou, Pengcheng Wen, Jiale Chen, Boqin Yin, Han Zhu, Jiaming Ji, Juntao Dai, Chi-Min Chan, Sirui Han,
Abstract summary: This work introduces the first comprehensive benchmark specifically designed for evaluating Process Reward Models (PRMs) under the thinking with images paradigm.<n>We define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement.<n>Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions.
Score: 18.918845075485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

Related papers

Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns [79.42805969325036]
Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks.<n>PRMs are required to identify errors under various reasoning patterns during the reasoning process.<n>Existing benchmarks mainly focus on evaluating PRMs with stepwise correctness.<n>We introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns.
arXiv Detail & Related papers (2025-05-29T14:26:53Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities [101.77467538102924]
Recent advancements in Large Reasoning Models (LRMs) have demonstrated remarkable performance in specialized reasoning tasks.<n>We show that acquiring deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs.<n>We demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks.
arXiv Detail & Related papers (2025-03-23T08:18:51Z)
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models [40.87249469370042]
Vision-language models (VLRMs) have become increasingly pivotal in the reasoning process.<n>Existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities.<n>We propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions.
arXiv Detail & Related papers (2025-03-10T15:52:57Z)
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning [32.850036320802474]
We introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle OOD issues.<n>By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup.<n>Our experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets.
arXiv Detail & Related papers (2025-02-20T08:40:09Z)
The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z)
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [28.74956741932006]
We introduce PRMBench, a process-level benchmark to assess the fine-grained error detection capabilities of PRMs.<n>PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions.
arXiv Detail & Related papers (2025-01-06T16:31:45Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.