Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning
- URL: http://arxiv.org/abs/2310.10080v1
- Date: Mon, 16 Oct 2023 05:21:50 GMT
- Title: Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning
- Authors: Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang
You and Hongxia Yang
- Abstract summary: Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
- Score: 64.27898739929734
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent years have seen considerable advancements in multi-step reasoning with
Large Language Models (LLMs). The previous studies have elucidated the merits
of integrating feedback or search mechanisms during model inference to improve
the reasoning accuracy. The Process-Supervised Reward Model (PRM), typically
furnishes LLMs with step-by-step feedback during the training phase, akin to
Proximal Policy Optimization (PPO) or reject sampling. Our objective is to
examine the efficacy of PRM in the inference phase to help discern the optimal
solution paths for multi-step tasks such as mathematical reasoning and code
generation. To this end, we propose a heuristic greedy search algorithm that
employs the step-level feedback from PRM to optimize the reasoning pathways
explored by LLMs. This tailored PRM demonstrated enhanced results compared to
the Chain of Thought (CoT) on mathematical benchmarks like GSM8K and MATH.
Additionally, to explore the versatility of our approach, we develop a novel
method to automatically generate step-level reward dataset for coding tasks and
observed similar improved performance in the code generation tasks. Thus
highlighting the robust nature of our reward-model-based approach to inference
for reasoning tasks.
Related papers
- Process Reward Model with Q-Value Rankings [18.907163177605607]
Process Reward Modeling (PRM) is critical for complex reasoning and decision-making tasks.
We introduce the Process Q-value Model (PQM), a novel framework that redefines PRM in the context of a Markov Decision Process.
PQM optimize Q-value rankings based on a novel comparative loss function, enhancing the model's ability to capture the intricate dynamics among sequential decisions.
arXiv Detail & Related papers (2024-10-15T05:10:34Z) - LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path.
The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z) - Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC)
We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions.
This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z) - Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO)
Our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.
arXiv Detail & Related papers (2024-06-16T09:06:17Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Let's Reinforce Step by Step [10.65244642965387]
We use Reinforcement Learning from Human Feedback to shape model reasoning processes.
Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning.
We also show the critical role reward aggregation functions play in model performance.
arXiv Detail & Related papers (2023-11-10T01:35:51Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.