Related papers: RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation

RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation

URL: http://arxiv.org/abs/2511.19895v1
Date: Tue, 25 Nov 2025 04:06:02 GMT
Title: RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation
Authors: Yuanyuan Lin, Xiangyu Ouyang, Teng Zhang, Kaixin Sui,
Abstract summary: RPM-MCTS is an effective method that utilizes Knowledge-Retrieval as Process Reward Model based on Monte Carlo Tree Search.<n>We show that RPM-MCTS outperforms current state-of-the-art methods while achieving an approximately 15% reduction in token consumption.
Score: 5.882211463956185
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tree search-based methods have made significant progress in enhancing the code generation capabilities of large language models. However, due to the difficulty in effectively evaluating intermediate algorithmic steps and the inability to locate and timely correct erroneous steps, these methods often generate incorrect code and incur increased computational costs. To tackle these problems, we propose RPM-MCTS, an effective method that utilizes Knowledge-Retrieval as Process Reward Model based on Monte Carlo Tree Search to evaluate intermediate algorithmic steps. By utilizing knowledge base retrieval, RPM-MCTS avoids the complex training of process reward models. During the expansion phase, similarity filtering is employed to remove redundant nodes, ensuring diversity in reasoning paths. Furthermore, our method utilizes sandbox execution feedback to locate erroneous algorithmic steps during generation, enabling timely and targeted corrections. Extensive experiments on four public code generation benchmarks demonstrate that RPM-MCTS outperforms current state-of-the-art methods while achieving an approximately 15% reduction in token consumption. Furthermore, full fine-tuning of the base model using the data constructed by RPM-MCTS significantly enhances its code capabilities.

Related papers

From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision [49.59309446816251]
Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy.<n>We propose Adaptive Monte Carlo Search (AMCS), a framework that transforms data generation from fixed, static to adaptive.<n>AMCS adaptively refines estimation by allocating more samples to uncertain reasoning steps while using fewer samples for those that are easier to estimate.
arXiv Detail & Related papers (2025-09-29T06:52:35Z)
ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding [15.051729280454454]
This paper introduces ReST-RL, a unified LLM RL paradigm.<n>It combines an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM)<n>We conduct extensive experiments on coding problems to verify the validity of the proposed RL paradigm.
arXiv Detail & Related papers (2025-08-27T05:16:03Z)
Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS [19.394761422323853]
We introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework.<n>R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning.<n> Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement.
arXiv Detail & Related papers (2025-07-08T00:41:12Z)
Accelerating Model-Based Reinforcement Learning using Non-Linear Trajectory Optimization [2.1386708011362257]
This paper addresses the slow policy optimization convergence of Monte Carlo Probabilistic Inference for Learning Control (MC-PILCO)<n>It integrates it with iterative Linear Quadratic Regulator (iLQR), a fast trajectory optimization method suitable for nonlinear systems.<n> Experiments on the cart-pole task demonstrate that EB-MC-PILCO accelerates convergence compared to standard MC-PILCO.
arXiv Detail & Related papers (2025-06-03T11:30:59Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search [61.11836311160951]
We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks.<n>Unlike standard RAG methods, which typically retrieve information independently from reasoning, MCTS-RAG combines structured reasoning with adaptive retrieval.<n>This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency.
arXiv Detail & Related papers (2025-03-26T17:46:08Z)
Reward-Centered ReST-MCTS: A Robust Decision-Making Framework for Robotic Manipulation in High Uncertainty Environments [0.0]
This paper introduces Reward-Centered ReST-MCTS, a novel framework that enhances Monte Carlo Tree Search.<n>The core of our approach is the Rewarding Center, which refines search trajectories by dynamically assigning partial rewards.<n>Compared to baseline methods, our framework achieves a 2-4% accuracy improvement while maintaining computational feasibility.
arXiv Detail & Related papers (2025-03-07T08:25:04Z)
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z)
Process Supervision-Guided Policy Optimization for Code Generation [15.943210767010045]
Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation.<n>We propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement.
arXiv Detail & Related papers (2024-10-23T07:22:33Z)
Improve Mathematical Reasoning in Language Models by Automated Process Supervision [23.807288360423193]
We propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named textitOmegaPRM for the efficient collection of high-quality process supervision data.<n>We are able to collect over 1.5 million process supervision annotations to train Process Reward Models (PRMs)<n>This fully automated process supervision alongside the weighted self-consistency algorithm is able to enhance LLMs' math reasoning performances.
arXiv Detail & Related papers (2024-06-05T19:25:40Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic [61.968469104271676]
We propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
arXiv Detail & Related papers (2023-01-28T04:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.