Chain-of-Thought Prompting for Out-of-Distribution Samples: A Latent-Variable Study
- URL: http://arxiv.org/abs/2504.12991v1
- Date: Thu, 17 Apr 2025 14:59:29 GMT
- Title: Chain-of-Thought Prompting for Out-of-Distribution Samples: A Latent-Variable Study
- Authors: Yu Wang, Fu-Chieh Chang, Pei-Yuan Wu,
- Abstract summary: Chain-of-Thought (CoT) prompting has emerged as a powerful technique to improve in-context learning in large language models.<n>We extend a latent-variable framework for CoT prompting and study its behavior on two prototypical out-of-distribution (OOD) scenarios.<n>Our experiments demonstrate that CoT inference generalizes effectively to OOD samples whose latent variables closely resemble those seen during training, but its performance degrades as this similarity decreases.
- Score: 5.236910203359897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful technique to improve in-context learning (ICL) in large language models (LLMs) by breaking complex reasoning into intermediate steps. However, the ability of CoT to generalize under distribution shift remains poorly understood. In this work, we extend a latent-variable framework for CoT prompting and study its behavior on two prototypical out-of-distribution (OOD) scenarios: (i) the latent variables for CoT steps are permuted into novel combinations, and (ii) the latent variables uniformly scaled by a factor. Our experiments demonstrate that CoT inference generalizes effectively to OOD samples whose latent variables closely resemble those seen during training, but its performance degrades as this similarity decreases. These findings provide foundational insights into the strengths and limitations of CoT prompting under OOD conditions and suggest directions for developing more resilient reasoning strategies in future LLMs.
Related papers
- Graded Transformers: A Symbolic-Geometric Approach to Structured Learning [0.0]
We introduce a novel class of sequence models that embed inductive biases through grading transformations on vector spaces.<n>The Graded Transformer holds transformative potential for hierarchical learning and neurosymbolic reasoning.<n>This work advances structured deep learning by fusing geometric and algebraic principles with attention mechanisms.
arXiv Detail & Related papers (2025-07-27T02:34:08Z) - Transformers Are Universally Consistent [14.904264782690639]
We show that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Least Squares regression.<n>We derive upper bounds on the empirical error which, in the regime, decay at a provable rate of $mathcalO(t-1/2d)$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality.
arXiv Detail & Related papers (2025-05-30T12:39:26Z) - One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks [11.06955946904705]
We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction.<n>Our work addresses gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention.
arXiv Detail & Related papers (2025-05-21T01:26:44Z) - The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning [39.613595533503144]
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models.<n>We show that CoT consistently underperforms direct answering across varying model scales and benchmark complexities.<n>Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL.
arXiv Detail & Related papers (2025-04-07T13:51:06Z) - Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting.<n>In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z) - Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization [35.16980045900664]
Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs)
This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization.
arXiv Detail & Related papers (2025-02-25T15:04:17Z) - Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization [9.191236388401226]
Training large language models with high-quality Chain-of-Thought (CoT) annotations has become a widely adopted strategy.
We show that training with CoT markedly improves reasoning generalization, extending it from in-distribution (ID) to both ID and out-of-distribution (OOD) scenarios, while also speeding up convergence.
Our findings elucidate the mechanisms underlying explicit CoT training and offer critical insights into tuning strategies for LLMs to achieve robust generalization.
arXiv Detail & Related papers (2025-02-07T05:21:13Z) - From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency [17.612497960364916]
Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM)<n>We demonstrate that CoT can substantially improve sample efficiency even when representation power is sufficient.<n>We show that CoT simplifies the learning process by introducing sparse dependencies among input tokens, and leads to a sparse and interpretable attention.
arXiv Detail & Related papers (2024-10-07T19:45:09Z) - Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods [59.779795063072655]
Chain-of-Thought (CoT) prompting and its variants have gained popularity as effective methods for solving multi-step reasoning problems.
We analyze CoT prompting from a statistical estimation perspective, providing a comprehensive characterization of its sample complexity.
arXiv Detail & Related papers (2024-08-25T04:07:18Z) - Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training.
Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking.
Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models [81.01397924280612]
Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations.
We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains.
arXiv Detail & Related papers (2023-04-23T13:54:39Z) - Towards Understanding Chain-of-Thought Prompting: An Empirical Study of
What Matters [82.84696222087396]
Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs)
We show that CoT reasoning is possible even with invalid demonstrations.
arXiv Detail & Related papers (2022-12-20T05:20:54Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - Existence, Stability and Scalability of Orthogonal Convolutional Neural
Networks [1.0742675209112622]
Imposing orthogonality on the layers of neural networks is known to facilitate the learning by limiting the exploding/vanishing of the gradient; decorrelate the features; improve the robustness.
This paper studies the theoretical properties of orthogonal convolutional layers.
arXiv Detail & Related papers (2021-08-12T09:30:53Z) - Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI)
Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z) - Is Supervised Syntactic Parsing Beneficial for Language Understanding?
An Empirical Investigation [71.70562795158625]
Traditional NLP has long held (supervised) syntactic parsing necessary for successful higher-level semantic language understanding (LU)
Recent advent of end-to-end neural models, self-supervised via language modeling (LM), and their success on a wide range of LU tasks, questions this belief.
We empirically investigate the usefulness of supervised parsing for semantic LU in the context of LM-pretrained transformer networks.
arXiv Detail & Related papers (2020-08-15T21:03:36Z) - Generalization Properties of Optimal Transport GANs with Latent
Distribution Learning [52.25145141639159]
We study how the interplay between the latent distribution and the complexity of the pushforward map affects performance.
Motivated by our analysis, we advocate learning the latent distribution as well as the pushforward map within the GAN paradigm.
arXiv Detail & Related papers (2020-07-29T07:31:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.