Solving math word problems with process- and outcome-based feedback
- URL: http://arxiv.org/abs/2211.14275v1
- Date: Fri, 25 Nov 2022 18:19:44 GMT
- Title: Solving math word problems with process- and outcome-based feedback
- Authors: Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah
Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, Irina Higgins
- Abstract summary: We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task.
We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision.
For correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models.
- Score: 15.331173715345125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that asking language models to generate reasoning steps
improves performance on many reasoning tasks. When moving beyond prompting,
this raises the question of how we should supervise such models: outcome-based
approaches which supervise the final result, or process-based approaches which
supervise the reasoning process itself? Differences between these approaches
might naturally be expected not just in final-answer errors but also in
reasoning errors, which can be difficult to detect and are problematic in many
real-world domains such as education. We run the first comprehensive comparison
between process- and outcome-based approaches trained on a natural language
task, GSM8K. We find that pure outcome-based supervision produces similar
final-answer error rates with less label supervision. However, for correct
reasoning steps we find it necessary to use process-based supervision or
supervision from learned reward models that emulate process-based feedback. In
total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer
error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct
solutions.
Related papers
- Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths.
Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance.
We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z) - Improve Mathematical Reasoning in Language Models by Automated Process Supervision [22.72856086318912]
We propose a novel Monte Carlo Tree Search (MCTS) algorithm named textitOmegaPRM for the efficient collection of high-quality process supervision data.
We are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM)
We have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4% success rate on the MATH benchmark.
arXiv Detail & Related papers (2024-06-05T19:25:40Z) - Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.
CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.
We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z) - Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning [54.585428241509234]
We propose R$3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL)
RL employs only outcome supervision to achieve the benefits of process supervision for large language models.
arXiv Detail & Related papers (2024-02-08T16:46:26Z) - OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning [15.59540726867483]
We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness.
Inspired by the findings that $textitoutcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM)
Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model.
arXiv Detail & Related papers (2023-11-16T09:56:28Z) - Let's Verify Step by Step [73.58107073356732]
We show that process supervision significantly outperforms outcome supervision for training models to solve problems.
Our model solves 78% of problems from a representative subset of the MATH test set.
We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
arXiv Detail & Related papers (2023-05-31T17:24:00Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z) - Global Optimization of Objective Functions Represented by ReLU Networks [77.55969359556032]
Neural networks can learn complex, non- adversarial functions, and it is challenging to guarantee their correct behavior in safety-critical contexts.
Many approaches exist to find failures in networks (e.g., adversarial examples), but these cannot guarantee the absence of failures.
We propose an approach that integrates the optimization process into the verification procedure, achieving better performance than the naive approach.
arXiv Detail & Related papers (2020-10-07T08:19:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.