Self-Evaluation Guided Beam Search for Reasoning
- URL: http://arxiv.org/abs/2305.00633v3
- Date: Thu, 26 Oct 2023 01:43:17 GMT
- Title: Self-Evaluation Guided Beam Search for Reasoning
- Authors: Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian
He, Qizhe Xie
- Abstract summary: We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
- Score: 61.523627290397556
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Breaking down a problem into intermediate steps has demonstrated impressive
performance in Large Language Model (LLM) reasoning. However, the growth of the
reasoning chain introduces uncertainty and error accumulation, making it
challenging to elicit accurate final results. To tackle this challenge of
uncertainty in multi-step reasoning, we introduce a stepwise self-evaluation
mechanism to guide and calibrate the reasoning process of LLMs. We propose a
decoding algorithm integrating the self-evaluation guidance via stochastic beam
search. The self-evaluation guidance serves as a better-calibrated automatic
criterion, facilitating an efficient search in the reasoning space and
resulting in superior prediction quality. Stochastic beam search balances
exploitation and exploration of the search space with temperature-controlled
randomness. Our approach surpasses the corresponding Codex-backboned baselines
in few-shot accuracy by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K, AQuA,
and StrategyQA benchmarks, respectively. Experiment results with Llama-2 on
arithmetic reasoning demonstrate the efficiency of our method in outperforming
the baseline methods with comparable computational budgets. Further analysis in
multi-step reasoning finds our self-evaluation guidance pinpoints logic
failures and leads to higher consistency and robustness. Our code is publicly
available at https://guideddecoding.github.io/.
Related papers
- FLARE: Faithful Logic-Aided Reasoning and Exploration [50.9814063216852]
We introduce a novel approach for traversing the problem space using task decompositions.
We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.
Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z) - Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models [21.96773736059112]
Language Language Models (LLMs) face safety concerns due to potential misuse by malicious users.
Recent red-teaming efforts have identified adversarial suffixes capable of jailbreaking LLMs using the gradient-based search algorithm Greedy Coordinate Gradient (GCG)
We propose a two-stage transfer learning framework, DeGCG, which decouples the search process into behavior-agnostic pre-searching and behavior-relevant post-searching.
arXiv Detail & Related papers (2024-08-27T08:38:48Z) - Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions.
We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Improve Mathematical Reasoning in Language Models by Automated Process Supervision [22.72856086318912]
We propose a novel Monte Carlo Tree Search (MCTS) algorithm named textitOmegaPRM for the efficient collection of high-quality process supervision data.
We are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM)
We have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4% success rate on the MATH benchmark.
arXiv Detail & Related papers (2024-06-05T19:25:40Z) - A Strong Baseline for Batch Imitation Learning [25.392006064406967]
We provide an easy-to-implement, novel algorithm for imitation learning under a strict data paradigm.
This paradigm allows our algorithm to be used for environments in which safety or cost are of critical concern.
arXiv Detail & Related papers (2023-02-06T14:03:33Z) - Reliable Causal Discovery with Improved Exact Search and Weaker
Assumptions [17.097192646470372]
We introduce several strategies to improve the scalability of exact score-based methods in the linear Gaussian setting.
We develop a super-structure estimation method based on the support of inverse covariance matrix which requires assumptions that are strictly weaker than faithfulness.
We also propose a local search strategy that performs exact search on the local clusters formed by each variable and its neighbors within two hops in the super-structure.
arXiv Detail & Related papers (2022-01-14T20:52:30Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven
Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems.
We propose a novel method for computing the normalized maximum likelihood (NML) distribution.
We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.