Multi-step Problem Solving Through a Verifier: An Empirical Analysis on
Model-induced Process Supervision
- URL: http://arxiv.org/abs/2402.02658v1
- Date: Mon, 5 Feb 2024 00:57:51 GMT
- Title: Multi-step Problem Solving Through a Verifier: An Empirical Analysis on
Model-induced Process Supervision
- Authors: Zihan Wang, Yunxuan Li, Yuexin Wu, Liangchen Luo, Le Hou, Hongkun Yu,
Jingbo Shang
- Abstract summary: We introduce Model-induced Process Supervision (MiPS), a novel method for automating data curation.
MiPS annotates an intermediate step by sampling completions of this solution through the reasoning model, and obtaining an accuracy defined as the proportion of correct completions.
Our approach significantly improves the performance of PaLM 2 on math and coding tasks.
- Score: 43.03988648915096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Process supervision, using a trained verifier to evaluate the intermediate
steps generated by reasoner, has demonstrated significant improvements in
multi-step problem solving. In this paper, to avoid expensive human annotation
effort on the verifier training data, we introduce Model-induced Process
Supervision (MiPS), a novel method for automating data curation. MiPS annotates
an intermediate step by sampling completions of this solution through the
reasoning model, and obtaining an accuracy defined as the proportion of correct
completions. Errors in the reasoner would cause MiPS to underestimate the
accuracy of intermediate steps, therefore, we suggest and empirically show that
verification focusing on high predicted scores of the verifier shall be
preferred over that of low predicted scores, contrary to prior work. Our
approach significantly improves the performance of PaLM 2 on math and coding
tasks (accuracy +0.67% on GSM8K, +4.16% on MATH, +0.92% on MBPP compared with
an output supervision trained verifier). Additionally, our study demonstrates
that the verifier exhibits strong generalization ability across different
reasoning models.
Related papers
- Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC)
We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions.
This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z) - Improve Mathematical Reasoning in Language Models by Automated Process Supervision [22.72856086318912]
We propose a novel Monte Carlo Tree Search (MCTS) algorithm named textitOmegaPRM for the efficient collection of high-quality process supervision data.
We are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM)
We have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4% success rate on the MATH benchmark.
arXiv Detail & Related papers (2024-06-05T19:25:40Z) - Semi-supervised 2D Human Pose Estimation via Adaptive Keypoint Masking [2.297586471170049]
This paper proposes an adaptive keypoint masking method, which can fully mine the information in the samples and obtain better estimation performance.
The effectiveness of the proposed method is verified on COCO and MPII, outperforming the state-of-the-art semi-supervised pose estimation by 5.2% and 0.3%, respectively.
arXiv Detail & Related papers (2024-04-23T08:41:50Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - A Study of Unsupervised Evaluation Metrics for Practical and Automatic
Domain Adaptation [15.728090002818963]
Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels.
In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels.
arXiv Detail & Related papers (2023-08-01T05:01:05Z) - Let's Verify Step by Step [73.58107073356732]
We show that process supervision significantly outperforms outcome supervision for training models to solve problems.
Our model solves 78% of problems from a representative subset of the MATH test set.
We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
arXiv Detail & Related papers (2023-05-31T17:24:00Z) - Boosting Out-of-Distribution Detection with Multiple Pre-trained Models [41.66566916581451]
Post hoc detection utilizing pre-trained models has shown promising performance and can be scaled to large-scale problems.
We propose a detection enhancement method by ensembling multiple detection decisions derived from a zoo of pre-trained models.
Our method substantially improves the relative performance by 65.40% and 26.96% on the CIFAR10 and ImageNet benchmarks.
arXiv Detail & Related papers (2022-12-24T12:11:38Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.