VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
- URL: http://arxiv.org/abs/2502.06737v2
- Date: Thu, 26 Jun 2025 20:39:38 GMT
- Title: VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
- Authors: Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee,
- Abstract summary: We introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method.<n> VersaPRM achieves consistent performance gains across diverse domains.<n>We further contribute to the community by open-sourcing all data, code and models for VersaPRM.
- Score: 21.460891616139534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline -- surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.
Related papers
- Rethinking Reward Models for Multi-Domain Test-Time Scaling [91.76069784586149]
Prior work generally assumes that process reward models (PRMs) outperform outcome reward models (ORMs) that assess only the final answer.<n>We present the first unified evaluation of four reward model variants across 14 diverse domains.<n>We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T04:21:14Z) - ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling [38.779046730647856]
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS)<n>We shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow.<n>Our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains.
arXiv Detail & Related papers (2025-09-29T08:40:46Z) - Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning [10.227089771963943]
We propose an uncertainty-driven framework for automated process reward data construction.<n>We introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and weighted Reward Frequency Vote.<n>Experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework.
arXiv Detail & Related papers (2025-08-03T14:14:13Z) - Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning [75.31797502976802]
We evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks.<n>We find that most models that succeed in math fail to transfer their gains to other domains.<n>Our results suggest a need to rethink standard post-training recipes.
arXiv Detail & Related papers (2025-07-01T05:23:05Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning [33.574626079343936]
We introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs.<n>In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights.<n>In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset.
arXiv Detail & Related papers (2025-05-26T17:20:17Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision [27.571090189791303]
We propose MM-PRM, a process reward model trained within a fully automated, scalable framework.<n>We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data.<n>We generate over 700k step-level annotations without human labeling.
arXiv Detail & Related papers (2025-05-19T17:55:08Z) - Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets [6.001837672951086]
We introduce a novel Process Reward Model (PRM) trained automatically using Monte Carlo Tree Search.
We then adapt Generative Flow Networks (GFlowNets) to operate at the reasoning step level.
Empirical evaluation shows strong improvements in both accuracy and solution diversity on challenging mathematical benchmarks.
arXiv Detail & Related papers (2025-04-28T16:56:41Z) - Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning [66.43194385702297]
Large Language Models (LLMs) have shown strong reasoning capabilities, particularly when enhanced through Reinforcement Learning (RL)
We propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora, including both synthetic and real-world question-answer pairs, into RL training to improve generalization across diverse reasoning tasks.
arXiv Detail & Related papers (2025-04-15T21:37:13Z) - GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning [35.429904556288996]
We introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification.
Experimental results show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset.
arXiv Detail & Related papers (2025-04-01T15:21:05Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.
Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.
We propose Reasoning-Driven Process Reward Modeling (R-PRM)
R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.
By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z) - Entropy-Regularized Process Reward Model [30.279394036823092]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z) - On Domain-Adaptive Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs via post-training.<n>We focus on data synthesis, training pipeline, and task evaluation.<n>We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z) - ControlMath: Controllable Data Generation Promotes Math Generalist Models [38.0858432336873]
We propose ControlMath, an iterative method involving an equation-generator module and two LLM-based agents.
The module creates diverse equations, which the Problem-Crafter agent then transforms into math word problems.
We collect ControlMathQA, which involves 190k math word problems.
arXiv Detail & Related papers (2024-09-20T03:58:26Z) - Task Oriented In-Domain Data Augmentation [38.525017729123114]
Large Language Models (LLMs) have shown superior performance in various applications and fields.
To achieve better performance on specialized domains such as law and advertisement, LLMs are often continue pre-trained on in-domain data.
We propose TRAIT, a task-oriented in-domain data augmentation framework.
arXiv Detail & Related papers (2024-06-24T14:58:11Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling [0.0]
We introduce the idea of Mixture-of-Experts (MoE) into the field of reward model (RM) training.
We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one.
Our model attains superior consistency with human preference and outstrips advanced generative approaches.
arXiv Detail & Related papers (2024-03-02T12:31:22Z) - ERM++: An Improved Baseline for Domain Generalization [69.80606575323691]
Empirical Risk Minimization (ERM) can outperform most more complex Domain Generalization (DG) methods when properly tuned.<n>ERM++ improves DG performance by over 5% compared to prior ERM baselines.
arXiv Detail & Related papers (2023-04-04T17:31:15Z) - FIXED: Frustratingly Easy Domain Generalization with Mixup [53.782029033068675]
Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains.
A popular strategy is to augment training data to benefit generalization through methods such as Mixupcitezhang 2018mixup.
We propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX)
Our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5% on average in terms of test accuracy.
arXiv Detail & Related papers (2022-11-07T09:38:34Z) - Multi-Label Contrastive Learning for Abstract Visual Reasoning [0.0]
State-of-the-art systems solving Raven's Progressive Matrices rely on massive pattern-based training and exploiting biases in the dataset.
Humans concentrate on identification of the rules / concepts underlying the RPM (or generally a visual reasoning task) to be solved.
We propose a new sparse rule encoding scheme for RPMs which, besides the new training algorithm, is the key factor contributing to the state-of-the-art performance.
arXiv Detail & Related papers (2020-12-03T14:18:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.