VRPRM: Process Reward Modeling via Visual Reasoning
- URL: http://arxiv.org/abs/2508.03556v1
- Date: Tue, 05 Aug 2025 15:25:24 GMT
- Title: VRPRM: Process Reward Modeling via Visual Reasoning
- Authors: Xinquan Chen, Bangwei Liu, Xuhong Wang,
- Abstract summary: We propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy.<n>Using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K.
- Score: 1.4076905229310113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118\% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.
Related papers
- ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - Process Reward Models That Think [86.88809596842428]
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling.<n>This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT)<n>We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
arXiv Detail & Related papers (2025-04-23T15:44:54Z) - Efficient Process Reward Model Training via Active Learning [27.846449143217704]
Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs)<n>We propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training.<n>A capable yet costly reasoning model then labels this data.<n>A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.
arXiv Detail & Related papers (2025-04-14T14:53:56Z) - Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models [33.547353090281284]
We propose a novel reward model approach called the Hierarchical Reward Model.<n>It evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels.<n>It excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection.
arXiv Detail & Related papers (2025-03-16T15:18:40Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - Free Process Rewards without Process Labels [55.14044050782222]
We show that an textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels.<n>We show that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction.
arXiv Detail & Related papers (2024-12-02T21:20:02Z) - Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data.
We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.