Large Language Models are Zero-Shot Reasoners
- URL: http://arxiv.org/abs/2205.11916v1
- Date: Tue, 24 May 2022 09:22:26 GMT
- Title: Large Language Models are Zero-Shot Reasoners
- Authors: Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke
Iwasawa
- Abstract summary: Chain of thought (CoT) prompting is a technique for eliciting complex multi-step reasoning through step-by-step answer examples.
We show that LLMs are decent zero-shot reasoners by simply adding Let's think step by step'' before each answer.
Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances.
- Score: 28.6899375595088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained large language models (LLMs) are widely used in many sub-fields of
natural language processing (NLP) and generally known as excellent few-shot
learners with task-specific exemplars. Notably, chain of thought (CoT)
prompting, a recent technique for eliciting complex multi-step reasoning
through step-by-step answer examples, achieved the state-of-the-art
performances in arithmetics and symbolic reasoning, difficult system-2 tasks
that do not follow the standard scaling laws for LLMs. While these successes
are often attributed to LLMs' ability for few-shot learning, we show that LLMs
are decent zero-shot reasoners by simply adding ``Let's think step by step''
before each answer. Experimental results demonstrate that our Zero-shot-CoT,
using the same single prompt template, significantly outperforms zero-shot LLM
performances on diverse benchmark reasoning tasks including arithmetics
(MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin
Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled
Objects), without any hand-crafted few-shot examples, e.g. increasing the
accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with
an off-the-shelf 175B parameter model. The versatility of this single prompt
across very diverse reasoning tasks hints at untapped and understudied
fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task
broad cognitive capabilities may be extracted through simple prompting. We hope
our work not only serves as the minimal strongest zero-shot baseline for the
challenging reasoning benchmarks, but also highlights the importance of
carefully exploring and analyzing the enormous zero-shot knowledge hidden
inside LLMs before crafting finetuning datasets or few-shot exemplars.
Related papers
- LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry.
We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks.
Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Zero-Shot Question Answering over Financial Documents using Large
Language Models [0.18749305679160366]
We introduce a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports.
We use novel zero-shot prompts that guide the LLM to encode the required reasoning into a Python program or a domain specific language.
arXiv Detail & Related papers (2023-11-19T16:23:34Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Take a Step Back: Evoking Reasoning via Abstraction in Large Language
Models [122.19845578690466]
Step-Back Prompting enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details.
Using the concepts and principles to guide reasoning, LLMs significantly improve their abilities in following a correct reasoning path towards the solution.
arXiv Detail & Related papers (2023-10-09T19:48:55Z) - Better Zero-Shot Reasoning with Self-Adaptive Prompting [39.54061907239995]
Modern large language models (LLMs) have demonstrated impressive capabilities at sophisticated tasks, often through step-by-step reasoning similar to humans.
We propose Consistency-based Self-adaptive Prompting (COSP), a novel prompt design method for LLMs.
We show that COSP improves performance up to 15% compared to zero-shot baselines and matches or exceeds few-shot baselines for a range of reasoning tasks.
arXiv Detail & Related papers (2023-05-23T14:27:16Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.