The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
- URL: http://arxiv.org/abs/2504.05081v1
- Date: Mon, 07 Apr 2025 13:51:06 GMT
- Title: The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
- Authors: Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Y. Wong, Simon See,
- Abstract summary: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models.<n>We show that CoT consistently underperforms direct answering across varying model scales and benchmark complexities.<n>Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL.
- Score: 39.613595533503144
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
Related papers
- Chain-of-Thought Prompting for Out-of-Distribution Samples: A Latent-Variable Study [5.236910203359897]
Chain-of-Thought (CoT) prompting has emerged as a powerful technique to improve in-context learning in large language models.
We extend a latent-variable framework for CoT prompting and study its behavior on two prototypical out-of-distribution (OOD) scenarios.
Our experiments demonstrate that CoT inference generalizes effectively to OOD samples whose latent variables closely resemble those seen during training, but its performance degrades as this similarity decreases.
arXiv Detail & Related papers (2025-04-17T14:59:29Z) - Investigating the Robustness of Deductive Reasoning with Large Language Models [7.494617747914778]
Large Language Models (LLMs) have been shown to achieve impressive results for many reasoning-based Natural Language Processing (NLP) tasks.
It remains unclear to which extent LLMs, in both informal and autoformalisation methods, are robust on logical deduction tasks.
arXiv Detail & Related papers (2025-02-04T17:16:51Z) - Rethinking Thinking Tokens: Understanding Why They Underperform in Practice [6.102559098873098]
Thinking Tokens (TT) have been proposed as an unsupervised method to facilitate reasoning in language models.
We show that TTs marginally improves performance and consistently underperforms compared to Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2024-11-18T08:34:38Z) - Concept-Reversed Winograd Schema Challenge: Evaluating and Improving Robust Reasoning in Large Language Models via Abstraction [48.20754793102953]
We evaluate the extent to which Large Language Models (LLMs) perform robust reasoning instead of relying on superficial logical chains.
We propose a new evaluation dataset, the Concept-Reversed Winograd Challenge (CR-WSC), based on the famous Winograd Challenge (WSC) dataset.
arXiv Detail & Related papers (2024-10-15T20:19:27Z) - Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought [61.588465852846646]
Chain-of-Thought (CoT) reasoning has emerged as a promising approach for enhancing the performance of large language models (LLMs)
In this work, we introduce a novel reasoning boundary framework (RBF) to address these challenges.
arXiv Detail & Related papers (2024-10-08T05:26:28Z) - Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data [53.433309883370974]
This work explores the potential and limitations of using graph-based synthetic reasoning data as training signals to enhance Large Language Models' reasoning capabilities.
Our experiments, conducted on two established natural language reasoning tasks, demonstrate that supervised fine-tuning with synthetic graph-based reasoning data effectively enhances LLMs' reasoning performance without compromising their effectiveness on other standard evaluation benchmarks.
arXiv Detail & Related papers (2024-09-19T03:39:09Z) - On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models [25.029579061612456]
Large Language Models (LLMs) are increasingly being employed in real-world applications in critical domains such as healthcare.
It is important to ensure that the Chain-of-Thought (CoT) reasoning generated by these models faithfully captures their underlying behavior.
arXiv Detail & Related papers (2024-06-15T13:16:44Z) - Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models [81.01397924280612]
Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations.
We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains.
arXiv Detail & Related papers (2023-04-23T13:54:39Z) - Towards Understanding Chain-of-Thought Prompting: An Empirical Study of
What Matters [82.84696222087396]
Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs)
We show that CoT reasoning is possible even with invalid demonstrations.
arXiv Detail & Related papers (2022-12-20T05:20:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.