OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning
- URL: http://arxiv.org/abs/2311.09724v2
- Date: Mon, 1 Apr 2024 13:50:51 GMT
- Title: OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning
- Authors: Fei Yu, Anningzhe Gao, Benyou Wang,
- Abstract summary: We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness.
Inspired by the findings that $textitoutcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM)
Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model.
- Score: 15.59540726867483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) often struggle with maintaining accuracy throughout multiple multiple reasoning steps, especially in mathematical reasoning where an error in earlier steps can propagate to subsequent ones and it ultimately leading to an incorrect answer. To reduce error propagation, guided decoding is employed to direct the LM decoding on a step-by-step basis. We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness, as the former approach leads towards a correct final answer. This transforms the task into a $\textit{value estimation}$ problem in planning. Inspired by the findings that $\textit{outcome supervision for guided decoding essentially acts as a value model}$, we propose Outcome-supervised Value Model (OVM) that employs outcome supervision for training a value model, which prioritizes steps that lead to accurate conclusions. Furthermore, the OVM eliminates the need for labor-intensive annotations of step-level correctness, thereby significantly enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our $\textbf{OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters}$; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training value models for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for guided decoding.
Related papers
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision [22.72856086318912]
We propose a novel Monte Carlo Tree Search (MCTS) algorithm named textitOmegaPRM for the efficient collection of high-quality process supervision data.
We are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM)
We have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4% success rate on the MATH benchmark.
arXiv Detail & Related papers (2024-06-05T19:25:40Z) - AlphaMath Almost Zero: process Supervision without process [6.318873143509028]
Large language models (LLMs) struggle with complex problems that require multiple reasoning steps.
We introduce an innovative approach that bypasses the need for process annotations (from human or GPTs) by utilizing the Monte Carlo Tree Search (MCTS) framework.
Our method iteratively trains the policy and value models, leveraging the capabilities of a well-pretrained LLM.
arXiv Detail & Related papers (2024-05-06T15:20:30Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning [54.585428241509234]
We propose R$3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL)
RL employs only outcome supervision to achieve the benefits of process supervision for large language models.
arXiv Detail & Related papers (2024-02-08T16:46:26Z) - Multi-step Problem Solving Through a Verifier: An Empirical Analysis on
Model-induced Process Supervision [43.03988648915096]
We introduce Model-induced Process Supervision (MiPS), a novel method for automating data curation.
MiPS annotates an intermediate step by sampling completions of this solution through the reasoning model, and obtaining an accuracy defined as the proportion of correct completions.
Our approach significantly improves the performance of PaLM 2 on math and coding tasks.
arXiv Detail & Related papers (2024-02-05T00:57:51Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - Value Gradient weighted Model-Based Reinforcement Learning [28.366157882991565]
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies.
VaGraM is a novel method for value-aware model learning.
arXiv Detail & Related papers (2022-04-04T13:28:31Z) - An Information Bottleneck Approach for Controlling Conciseness in
Rationale Extraction [84.49035467829819]
We show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective.
Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale.
arXiv Detail & Related papers (2020-05-01T23:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.