Making Large Language Models Better Reasoners with Step-Aware Verifier
- URL: http://arxiv.org/abs/2206.02336v3
- Date: Wed, 24 May 2023 04:08:08 GMT
- Title: Making Large Language Models Better Reasoners with Step-Aware Verifier
- Authors: Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou,
Weizhu Chen
- Abstract summary: DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
- Score: 49.16750018427259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot learning is a challenging task that requires language models to
generalize from limited examples. Large language models like GPT-3 and PaLM
have made impressive progress in this area, but they still face difficulties in
reasoning tasks such as GSM8K, a benchmark for arithmetic problems. To improve
their reasoning skills, previous work has proposed to guide the language model
with prompts that elicit a series of reasoning steps before giving the final
answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in
problem-solving rate. In this paper, we present DIVERSE (Diverse Verifier on
Reasoning Step), a novel approach that further enhances the reasoning
capability of language models. DIVERSE has three main components: first, it
generates diverse prompts to explore different reasoning paths for the same
question; second, it uses a verifier to filter out incorrect answers based on a
weighted voting scheme; and third, it verifies each reasoning step individually
instead of the whole chain. We evaluate DIVERSE on the latest language model
code-davinci-002 and show that it achieves new state-of-the-art results on six
of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%).
Related papers
- TypedThinker: Typed Thinking Improves Large Language Model Reasoning [44.8904486513791]
We propose TypedThinker, a framework that enhances Large Language Models' problem-solving abilities.
TypedThinker addresses two key challenges: selecting appropriate reasoning types for given problems and effectively implementing specific reasoning types.
Experimental results demonstrate significant improvements over baseline models, with accuracy increases of 3.4% for Mistral 7B and 16.7% for LLaMA3 8B.
arXiv Detail & Related papers (2024-10-02T18:54:45Z) - LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models.
We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages.
We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z) - Large Language Models are Contrastive Reasoners [8.427805316635318]
We show how contrastive prompting significantly improves the ability of large language models to perform complex reasoning.
Experiments on various large language models show that zero-shot contrastive prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.
Our method not only surpasses zero-shot CoT and few-shot CoT in most arithmetic and commonsense reasoning tasks but also can seamlessly integrate with existing prompting methods.
arXiv Detail & Related papers (2024-03-13T03:15:05Z) - Investigating the Efficacy of Large Language Models in Reflective
Assessment Methods through Chain of Thoughts Prompting [0.2552922646705803]
Chain of Thought(CoT) prompting method has been proposed as a means to enhance LLMs' proficiency in complex reasoning tasks.
The primary aim of this research is to assess how well four language models can grade reflective essays of third-year medical students.
arXiv Detail & Related papers (2023-09-30T06:25:27Z) - Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT)
ToT generalizes over the popular Chain of Thought approach to prompting language models.
Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z) - Complexity-Based Prompting for Multi-Step Reasoning [72.0057198610614]
We study the task of prompting large-scale language models to perform multi-step reasoning.
A central question is which reasoning examples make the most effective prompts.
We propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning.
arXiv Detail & Related papers (2022-10-03T05:33:27Z) - Reasoning Like Program Executors [48.819113224699976]
POET empowers language models to harvest the reasoning knowledge possessed in program executors via a data-driven approach.
PoET can significantly boost model performance on natural language reasoning.
PoET opens a new gate on reasoning-enhancement pre-training.
arXiv Detail & Related papers (2022-01-27T12:28:24Z) - CS-NLP team at SemEval-2020 Task 4: Evaluation of State-of-the-art NLP
Deep Learning Architectures on Commonsense Reasoning Task [3.058685580689605]
We describe our attempt at SemEval-2020 Task 4 competition: Commonsense Validation and Explanation (ComVE) challenge.
Our system uses prepared labeled textual datasets that were manually curated for three different natural language inference subtasks.
For the second subtask, which is to select the reason why a statement does not make sense, we stand within the first six teams (93.7%) among 27 participants with very competitive results.
arXiv Detail & Related papers (2020-05-17T13:20:10Z) - A Simple Language Model for Task-Oriented Dialogue [61.84084939472287]
SimpleTOD is a simple approach to task-oriented dialogue that uses a single, causal language model trained on all sub-tasks recast as a single sequence prediction problem.
This allows SimpleTOD to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2.
arXiv Detail & Related papers (2020-05-02T11:09:27Z) - Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem.
Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters.
Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.