Solving math word problems with process- and outcome-based feedback
- URL: http://arxiv.org/abs/2211.14275v1
- Date: Fri, 25 Nov 2022 18:19:44 GMT
- Title: Solving math word problems with process- and outcome-based feedback
- Authors: Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah
Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, Irina Higgins
- Abstract summary: We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task.
We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision.
For correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models.
- Score: 15.331173715345125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that asking language models to generate reasoning steps
improves performance on many reasoning tasks. When moving beyond prompting,
this raises the question of how we should supervise such models: outcome-based
approaches which supervise the final result, or process-based approaches which
supervise the reasoning process itself? Differences between these approaches
might naturally be expected not just in final-answer errors but also in
reasoning errors, which can be difficult to detect and are problematic in many
real-world domains such as education. We run the first comprehensive comparison
between process- and outcome-based approaches trained on a natural language
task, GSM8K. We find that pure outcome-based supervision produces similar
final-answer error rates with less label supervision. However, for correct
reasoning steps we find it necessary to use process-based supervision or
supervision from learned reward models that emulate process-based feedback. In
total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer
error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct
solutions.
Related papers
- Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective [59.61868506896214]
We show that under standard data coverage assumptions, reinforcement learning is no more statistically difficult than through process supervision.
We prove that any policy's advantage function can serve as an optimal process reward model.
arXiv Detail & Related papers (2025-02-14T22:21:56Z) - Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.
This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.
We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z) - ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.
ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.
We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z) - Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths.
Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance.
We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z) - Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning [54.585428241509234]
We propose R$3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL)
RL employs only outcome supervision to achieve the benefits of process supervision for large language models.
arXiv Detail & Related papers (2024-02-08T16:46:26Z) - OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning [15.59540726867483]
We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness.
Inspired by the findings that $textitoutcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM)
Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model.
arXiv Detail & Related papers (2023-11-16T09:56:28Z) - Let's Verify Step by Step [73.58107073356732]
We show that process supervision significantly outperforms outcome supervision for training models to solve problems.
Our model solves 78% of problems from a representative subset of the MATH test set.
We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
arXiv Detail & Related papers (2023-05-31T17:24:00Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.