The Kinetics of Reasoning: How Chain-of-Thought Shapes Learning in Transformers?
- URL: http://arxiv.org/abs/2510.25791v1
- Date: Tue, 28 Oct 2025 20:14:26 GMT
- Title: The Kinetics of Reasoning: How Chain-of-Thought Shapes Learning in Transformers?
- Authors: Zihan Pengmei, Costas Mavromatis, Zhengyuan Shen, Yunyi Zhang, Vassilis N. Ioannidis, Huzefa Rangwala,
- Abstract summary: Chain-of-thought (CoT) supervision can substantially improve transformer performance.<n>We investigate these learning dynamics through the lens of grokking by pretraining transformers on symbolic reasoning tasks.
- Score: 25.29458951592086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-thought (CoT) supervision can substantially improve transformer performance, yet the mechanisms by which models learn to follow and benefit from CoT remain poorly understood. We investigate these learning dynamics through the lens of grokking by pretraining transformers on symbolic reasoning tasks with tunable algorithmic complexity and controllable data composition to study their generalization. Models were trained under two settings: (i) producing only final answers, and (ii) emitting explicit CoT traces before answering. Our results show that while CoT generally improves task performance, its benefits depend on task complexity. To quantify these effects, we model the accuracy of the logarithmic training steps with a three-parameter logistic curve, revealing how the learning speed and shape vary with task complexity, data distribution, and the presence of CoT supervision. We also uncover a transient trace unfaithfulness phase: early in training, models often produce correct answers while skipping or contradicting CoT steps, before later aligning their reasoning traces with answers. Empirically, we (1) demonstrate that CoT accelerates generalization but does not overcome tasks with higher algorithmic complexity, such as finding list intersections; (2) introduce a kinetic modeling framework for understanding transformer learning; (3) characterize trace faithfulness as a dynamic property that emerges over training; and (4) show CoT alters internal transformer computation mechanistically.
Related papers
- Incremental Learning of Sparse Attention Patterns in Transformers [29.54151079577767]
This paper introduces a high-order Markov chain task to investigate how transformers learn to integrate information from multiple past positions.<n>We identify a shift in learning dynamics from competitive, where heads converge on the most statistically dominant pattern, to cooperative, where heads specialize in distinct patterns.
arXiv Detail & Related papers (2026-02-22T12:16:06Z) - Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems [59.94955550958074]
We study a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network.<n>We show that expert specialization reduces gradient conflicts and makes each subtask strongly convex.<n>We prove that the training drives the expected prediction loss to near zero in $O(log(epsilon-1)$ steps, significantly improving over the $O(epsilon-1)$ rate for a single transformer.
arXiv Detail & Related papers (2025-10-30T21:07:36Z) - Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z) - Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking [41.3496135369579]
Chain-of-thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks.<n>There is limited mechanistic understanding of the algorithms that Transformer+CoT can learn.<n>We evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT.
arXiv Detail & Related papers (2025-02-27T14:24:51Z) - Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization [35.16980045900664]
Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs)<n>This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization.
arXiv Detail & Related papers (2025-02-25T15:04:17Z) - Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - Transformers Provably Solve Parity Efficiently with Chain of Thought [40.78854925996]
This work provides the first theoretical analysis of training transformers to solve complex problems.<n>We consider training a one-layer transformer to solve the fundamental $k$-parity problem.
arXiv Detail & Related papers (2024-10-11T08:55:17Z) - Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527835]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.<n>We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.