Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning
- URL: http://arxiv.org/abs/2310.10080v1
- Date: Mon, 16 Oct 2023 05:21:50 GMT
- Title: Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning
- Authors: Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang
You and Hongxia Yang
- Abstract summary: Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
- Score: 64.27898739929734
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent years have seen considerable advancements in multi-step reasoning with
Large Language Models (LLMs). The previous studies have elucidated the merits
of integrating feedback or search mechanisms during model inference to improve
the reasoning accuracy. The Process-Supervised Reward Model (PRM), typically
furnishes LLMs with step-by-step feedback during the training phase, akin to
Proximal Policy Optimization (PPO) or reject sampling. Our objective is to
examine the efficacy of PRM in the inference phase to help discern the optimal
solution paths for multi-step tasks such as mathematical reasoning and code
generation. To this end, we propose a heuristic greedy search algorithm that
employs the step-level feedback from PRM to optimize the reasoning pathways
explored by LLMs. This tailored PRM demonstrated enhanced results compared to
the Chain of Thought (CoT) on mathematical benchmarks like GSM8K and MATH.
Additionally, to explore the versatility of our approach, we develop a novel
method to automatically generate step-level reward dataset for coding tasks and
observed similar improved performance in the code generation tasks. Thus
highlighting the robust nature of our reward-model-based approach to inference
for reasoning tasks.
Related papers
- Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO)
Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning.
From the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model.
arXiv Detail & Related papers (2024-06-16T09:06:17Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Metric-aware LLM inference for regression and scoring [52.764328080398805]
Large language models (LLMs) have demonstrated strong results on a range of NLP tasks.
We show that this inference strategy can be suboptimal for a range of regression and scoring tasks, and associated evaluation metrics.
We propose aware metric LLM inference: a decision theoretic approach optimizing for custom regression and scoring metrics at inference time.
arXiv Detail & Related papers (2024-03-07T03:24:34Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Let's Reinforce Step by Step [10.65244642965387]
We use Reinforcement Learning from Human Feedback to shape model reasoning processes.
Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning.
We also show the critical role reward aggregation functions play in model performance.
arXiv Detail & Related papers (2023-11-10T01:35:51Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Model-based Adversarial Meta-Reinforcement Learning [38.28304764312512]
We propose Model-based Adversarial Meta-Reinforcement Learning (AdMRL)
AdMRL aims to minimize the worst-case sub-optimality gap across all tasks in a family of tasks.
We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks.
arXiv Detail & Related papers (2020-06-16T02:21:49Z) - Hindsight Expectation Maximization for Goal-conditioned Reinforcement
Learning [26.631740480100724]
We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective.
The E-step provides a natural interpretation of how 'learning in hindsight' techniques, such as HER, to handle extremely sparse goal-conditioned rewards.
The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images.
arXiv Detail & Related papers (2020-06-13T03:25:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.