StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
- URL: http://arxiv.org/abs/2509.22558v2
- Date: Wed, 01 Oct 2025 19:49:36 GMT
- Title: StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models
- Authors: Chenyu Zhou, Tianyi Xu, Jianghao Lin, Dongdong Ge,
- Abstract summary: We introduce StepORLM, a novel self-evolving framework with generative process supervision.<n>At its core, StepORLM features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other.
- Score: 18.500046072165254
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have shown promising capabilities for solving Operations Research (OR) problems. While reinforcement learning serves as a powerful paradigm for LLM training on OR problems, existing works generally face two key limitations. First, outcome reward suffers from the credit assignment problem, where correct final answers can reinforce flawed reasoning. Second, conventional discriminative process supervision is myopic, failing to evaluate the interdependent steps of OR modeling holistically. To this end, we introduce StepORLM, a novel self-evolving framework with generative process supervision. At its core, StepORLM features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other. This loop is driven by a dual-feedback mechanism: definitive, outcome-based verification from an external solver, and nuanced, holistic process evaluation from the GenPRM. The combined signal is used to align the policy via Weighted Direct Preference Optimization (W-DPO) and simultaneously refine the GenPRM. Our resulting 8B-parameter StepORLM establishes a new state-of-the-art across six benchmarks, significantly outperforming vastly larger generalist models, agentic methods, and specialized baselines. Moreover, the co-evolved GenPRM is able to act as a powerful and universally applicable process verifier, substantially boosting the inference scaling performance of both our own model and other existing LLMs.
Related papers
- ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation [54.071574153853994]
ProRAG is a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop.<n>Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism.
arXiv Detail & Related papers (2026-01-29T16:04:59Z) - Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs [23.590034731179824]
We present Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples cognitive modes.<n>Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis.<n>Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches.
arXiv Detail & Related papers (2025-11-08T21:01:27Z) - The Landscape of Agentic Reinforcement Learning for LLMs: A Survey [103.32591749156416]
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL)<n>This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL.
arXiv Detail & Related papers (2025-09-02T17:46:26Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning [12.724393910603299]
We introduce the Generative Multimodal Process Reward Model (GM-PRM)<n>Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step.<n>We show that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks.
arXiv Detail & Related papers (2025-08-06T05:10:29Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - Entropy-Regularized Process Reward Model [43.09203393852343]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z) - MALT: Improving Reasoning with Multi-Agent LLM Training [66.9481561915524]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.