Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
- URL: http://arxiv.org/abs/2509.23365v2
- Date: Mon, 06 Oct 2025 00:40:29 GMT
- Title: Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
- Authors: Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian,
- Abstract summary: We theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem.<n>Our analysis reveals that during training using continuous thought, the index-matching logit will first increase and then remain bounded under mild assumptions.
- Score: 64.43689151961054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages -- (i) a thought-generation stage that autoregressively expands the continuous thought, and (ii) a prediction stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model's local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.
Related papers
- Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure [58.89643769707751]
We study latent chain-of-thought as a manipulable causal process in representation space.<n>We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing.<n>These results motivate mode-conditional and stability-aware analyses as more reliable tools for interpreting and improving latent reasoning systems.
arXiv Detail & Related papers (2026-02-09T15:25:12Z) - Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision [30.75583081407994]
Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models.<n>Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence.<n>Under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length increases.
arXiv Detail & Related papers (2026-01-31T22:54:45Z) - Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models [50.39102836928242]
We introduce a representational perspective to investigate the dynamics of the model's internal states.<n>We discover that post-training yields only limited improvement in static initial representation quality.
arXiv Detail & Related papers (2026-01-31T15:23:33Z) - Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z) - How LLMs Learn to Reason: A Complex Network Perspective [14.638878448692493]
Training large language models with Reinforcement Learning from Verifiable Rewards exhibits a set of puzzling behaviors.<n>We propose that these seemingly disparate phenomena can be explained using a single unifying theory.<n>Our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.
arXiv Detail & Related papers (2025-09-28T04:10:37Z) - Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors [11.658759174514222]
Chain-of-Thought reasoning has emerged as a pivotal methodology for enhancing model inference capabilities.<n>This paper explores the working mechanisms of Chain-of-Thought reasoning from the perspective of the dual relationship between in-context learning and pretrained priors.
arXiv Detail & Related papers (2025-09-01T08:24:28Z) - Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent [66.78052387054593]
This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes.<n>We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task.<n>We show that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees.
arXiv Detail & Related papers (2025-08-11T17:40:47Z) - CTRLS: Chain-of-Thought Reasoning via Latent State-Transition [57.51370433303236]
Chain-of-thought (CoT) reasoning enables large language models to break down complex problems into interpretable intermediate steps.<n>We introduce groundingS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions.<n>We show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
arXiv Detail & Related papers (2025-07-10T21:32:18Z) - A Survey on Latent Reasoning [100.54120559169735]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities.<n>CoT reasoning that verbalizes intermediate steps limits the model's expressive bandwidth.<n>Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state.
arXiv Detail & Related papers (2025-07-08T17:29:07Z) - How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z) - Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting.<n>In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z) - In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z) - Transformers Provably Solve Parity Efficiently with Chain of Thought [40.78854925996]
This work provides the first theoretical analysis of training transformers to solve complex problems.<n>We consider training a one-layer transformer to solve the fundamental $k$-parity problem.
arXiv Detail & Related papers (2024-10-11T08:55:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.