When do you need Chain-of-Thought Prompting for ChatGPT?
- URL: http://arxiv.org/abs/2304.03262v2
- Date: Tue, 18 Apr 2023 14:45:18 GMT
- Title: When do you need Chain-of-Thought Prompting for ChatGPT?
- Authors: Jiuhai Chen, Lichang Chen, Heng Huang, Tianyi Zhou
- Abstract summary: Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models(LLMs)
It is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT.
- Score: 87.45382888430643
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step
reasoning from Large Language Models~(LLMs). For example, by simply adding CoT
instruction ``Let's think step-by-step'' to each input query of MultiArith
dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is
not clear whether CoT is still effective on more recent instruction finetuned
(IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer
effective for certain tasks such as arithmetic reasoning while still keeping
effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT
usually achieves the best performance and can generate CoT even without being
instructed to do so. Hence, it is plausible that ChatGPT has already been
trained on these tasks with CoT and thus memorized the instruction so it
implicitly follows such an instruction when applied to the same queries, even
without CoT. Our analysis reflects a potential risk of overfitting/bias toward
instructions introduced in IFT, which becomes more common in training LLMs. In
addition, it indicates possible leakage of the pretraining recipe, e.g., one
can verify whether a dataset and instruction were used in training ChatGPT. Our
experiments report new baseline results of ChatGPT on a variety of reasoning
tasks and shed novel insights into LLM's profiling, instruction memorization,
and pretraining dataset leakage.
Related papers
- To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs)
We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z) - Can Separators Improve Chain-of-Thought Prompting? [10.398343318429367]
Chain-of-thought (CoT) prompting is a simple and effective method for improving the reasoning capabilities of Large Language Models (LLMs)
Inspired by human cognition, we introduce COT-SEP, a method that strategically employs separators at the end of each exemplar in CoT prompting.
arXiv Detail & Related papers (2024-02-16T12:46:16Z) - Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning [31.110005898556892]
Large Language Models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning.
We propose CoT-Influx, a novel approach that pushes the boundary of few-shot Chain-of-Thoughts (CoT) learning.
CoT-Influx employs a coarse-to-fine pruner to maximize the input of effective and concise CoT examples.
arXiv Detail & Related papers (2023-12-14T13:03:13Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Investigating the Effectiveness of Task-Agnostic Prefix Prompt for
Instruction Following [44.701091969256055]
We present our finding that prepending a Task-Agnostic Prefix Prompt (TAPP) to the input improves the instruction-following ability of various Large Language Models (LLMs) during inference.
We observe that both base LLMs (i.e. not fine-tuned to follow instructions) and instruction-tuned models benefit from TAPP, resulting in 34.58% and 12.26% improvement on average.
arXiv Detail & Related papers (2023-02-28T16:06:35Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - Towards Understanding Chain-of-Thought Prompting: An Empirical Study of
What Matters [82.84696222087396]
Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs)
We show that CoT reasoning is possible even with invalid demonstrations.
arXiv Detail & Related papers (2022-12-20T05:20:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.