Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- URL: http://arxiv.org/abs/2410.21333v3
- Date: Fri, 08 Nov 2024 13:11:58 GMT
- Title: Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- Authors: Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, Thomas L. Griffiths,
- Abstract summary: Chain-of-thought (CoT) has become a widely used strategy for working with large language and multimodal models.
We identify characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology.
We find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance when using inference-time reasoning.
- Score: 9.542503507653494
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Chain-of-thought (CoT) prompting has become a widely used strategy for working with large language and multimodal models. While CoT has been shown to improve performance across many tasks, determining the settings in which it is effective remains an ongoing effort. In particular, it is still an open question in what settings CoT systematically reduces model performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, looking at cases where (i) verbal thinking or deliberation hurts performance in humans, and (ii) the constraints governing human performance generalize to language models. Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions. In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o) when using inference-time reasoning compared to zero-shot counterparts. We also identify three tasks that satisfy condition (i) but not (ii), and find that while verbal thinking reduces human performance in these tasks, CoT retains or increases model performance. Overall, our results show that while there is not an exact parallel between the cognitive processes of models and those of humans, considering cases where thinking has negative consequences for human performance can help us identify settings where it negatively impacts models. By connecting the literature on human deliberation with evaluations of CoT, we offer a new tool that can be used in understanding the impact of prompt choices and inference-time reasoning.
Related papers
- Subject Invariant Contrastive Learning for Human Activity Recognition [14.10876324116018]
We introduce Subject-Invariant Contrastive Learning (SICL) to improve generalization in human activity recognition.<n>SICL re-weights negative pairs drawn from the same subject to suppress subject-specific cues and emphasize activity-specific information.<n>We show that SICL improves performance by up to 11% over traditional contrastive learning methods.
arXiv Detail & Related papers (2025-07-04T01:55:33Z) - Dynamic Programming Techniques for Enhancing Cognitive Representation in Knowledge Tracing [125.75923987618977]
We propose the Cognitive Representation Dynamic Programming based Knowledge Tracing (CRDP-KT) model.<n>It is a dynamic programming algorithm to optimize cognitive representations based on the difficulty of the questions and the performance intervals between them.<n>It provides more accurate and systematic input features for subsequent model training, thereby minimizing distortion in the simulation of cognitive states.
arXiv Detail & Related papers (2025-06-03T14:44:48Z) - Improving Question Embeddings with Cognitiv Representation Optimization for Knowledge Tracing [77.14348157016518]
The Knowledge Tracing (KT) aims to track changes in students' knowledge status and predict their future answers based on their historical answer records.
Current research on KT modeling focuses on predicting student' future performance based on existing, unupdated records of student learning interactions.
We propose a Cognitive Representation Optimization for Knowledge Tracing model, which utilizes a dynamic programming algorithm to optimize structure of cognitive representations.
arXiv Detail & Related papers (2025-04-05T09:32:03Z) - Large (Vision) Language Models are Unsupervised In-Context Learners [14.930827851769276]
We introduce a joint inference framework for fully unsupervised adaptation.
Unlike zero-shot inference, the joint inference makes predictions simultaneously for all inputs in a given task.
Our experiments demonstrate substantial improvements over the standard zero-shot approach.
arXiv Detail & Related papers (2025-04-03T07:33:02Z) - The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks [96.27754404942364]
Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited.<n>This paper introduces and analyzes overthinking in LRMs.<n>We observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement.
arXiv Detail & Related papers (2025-02-12T09:23:26Z) - Boosting Self-Efficacy and Performance of Large Language Models via Verbal Efficacy Stimulations [10.209999691197948]
This paper introduces Verbal Efficacy Stimulations (VES)<n>VES comprises three types of verbal prompts: encouraging, provocative, and critical, addressing six aspects such as helpfulness and competence.<n>The experimental results show that the three types of VES improve the performance of LLMs on most tasks, and the most effective VES varies for different models.
arXiv Detail & Related papers (2025-02-10T16:54:03Z) - Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm [9.577716124021029]
We argue that GPT 3.5's declining performance on 2-back and 3-back tasks reflects a working memory capacity limit similar to humans.
By analyzing a range of open-source language models of varying performance levels on these tasks, we show that the poor performance instead reflects a limitation in task comprehension and task set maintenance.
arXiv Detail & Related papers (2024-12-24T03:06:52Z) - Can foundation models actively gather information in interactive environments to test hypotheses? [56.651636971591536]
We introduce a framework in which a model must determine the factors influencing a hidden reward function.
We investigate whether approaches such as self- throughput and increased inference time improve information gathering efficiency.
arXiv Detail & Related papers (2024-12-09T12:27:21Z) - The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities.
TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models.
Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks [38.63497972682599]
This study investigates the performance of alignment methods across three scenarios: keeping theSupervised Fine-Tuning part, skipping the SFT part, and utilizing an instruction-tuned model.
Our evaluation spans a range of tasks including dialogue systems, reasoning, mathematical problem-solving, question answering, truthfulness, and multi-task understanding.
Key observations reveal that alignment methods achieve optimal performance with smaller training data subsets, exhibit limited effectiveness in reasoning tasks yet significantly impact mathematical problem-solving, and employing an instruction-tuned model notably influences truthfulness.
arXiv Detail & Related papers (2024-04-23T03:55:01Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Prompt Perturbation Consistency Learning for Robust Language Models [47.021022978847036]
Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks.
We show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models.
We propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples.
arXiv Detail & Related papers (2024-02-24T15:00:58Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - A Wide Evaluation of ChatGPT on Affective Computing Tasks [32.557383931586266]
We study the capabilities of the ChatGPT models, namely GPT-4 and GPT-3.5, on 13 affective computing problems.
We compare ChatGPT against more traditional NLP methods, such as end-to-end recurrent neural networks and transformers.
The results demonstrate the emergent abilities of the ChatGPT models on a wide range of affective computing problems.
arXiv Detail & Related papers (2023-08-26T16:10:30Z) - Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically.
In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs.
Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z) - OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities [19.83434949066066]
This paper introduces a novel intelligent framework, referred to as OlaGPT.
OlaGPT carefully studied a cognitive architecture framework, and propose to simulate certain aspects of human cognition.
The framework involves approximating different cognitive modules, including attention, memory, reasoning, learning, and corresponding scheduling and decision-making mechanisms.
arXiv Detail & Related papers (2023-05-23T09:36:51Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - Planning for Sample Efficient Imitation Learning [52.44953015011569]
Current imitation algorithms struggle to achieve high performance and high in-environment sample efficiency simultaneously.
We propose EfficientImitate, a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously.
Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency.
arXiv Detail & Related papers (2022-10-18T05:19:26Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Towards Unbiased Visual Emotion Recognition via Causal Intervention [63.74095927462]
We propose a novel Emotion Recognition Network (IERN) to alleviate the negative effects brought by the dataset bias.
A series of designed tests validate the effectiveness of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms other state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-26T10:40:59Z) - A Minimalist Dataset for Systematic Generalization of Perception,
Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts.
In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images.
We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z) - Modeling Score Distributions and Continuous Covariates: A Bayesian
Approach [8.772459063453285]
We develop a generative model of the match and non-match score distributions over continuous covariates.
We use mixture models to capture arbitrary distributions and local basis functions.
Three experiments demonstrate the accuracy and effectiveness of our approach.
arXiv Detail & Related papers (2020-09-21T02:41:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.