Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization
- URL: http://arxiv.org/abs/2511.07378v1
- Date: Mon, 10 Nov 2025 18:40:24 GMT
- Title: Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization
- Authors: Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, Yuxin Chen,
- Abstract summary: A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT)<n>We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT.<n>We provide the first optimization guarantee that constant-depth transformers provably learn $mathsfNC1$-complete problems with CoT.
- Score: 53.89723291716722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT. Specifically, our theory characterizes the length generalization of transformers through the mechanism of attention concentration, linking the retrieval robustness of the attention layer to the state-tracking task structure of long-context reasoning. Moreover, for transformers with limited reasoning length, we prove that a recursive self-training scheme can progressively extend the range of solvable problem lengths. To our knowledge, we provide the first optimization guarantee that constant-depth transformers provably learn $\mathsf{NC}^1$-complete problems with CoT, significantly going beyond prior art confined in $\mathsf{TC}^0$, unless the widely held conjecture $\mathsf{TC}^0 \neq \mathsf{NC}^1$ fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.
Related papers
- Rational Transductors [30.916600771560933]
We introduce emphRational Transductors, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence.<n>By injecting rational state information into the attention mechanism via a emphDeep Rational Injection scheme, our framework strictly generalizes the expressive power of Transformers.
arXiv Detail & Related papers (2026-02-07T16:01:10Z) - Quantitative Bounds for Length Generalization in Transformers [58.175107357008876]
We study the problem of length generalization (LG) in transformers.<n>LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences.
arXiv Detail & Related papers (2025-10-30T21:31:36Z) - The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner [71.41162392872393]
This paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of large language models.<n>TAIL synthesizes chain-of-thoughts (CoT) data that imitates the execution process of a Turing Machine by computer programs.<n>Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks.
arXiv Detail & Related papers (2025-07-17T17:50:07Z) - Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought [64.43689151961054]
We prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem.<n>In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously.
arXiv Detail & Related papers (2025-05-18T18:36:53Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527835]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.<n>We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z) - Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers [18.331374727331077]
Time complexity of tensor attention is a significant obstacle to its utilization in transformers.
We prove that the backward gradient of attention training can be computed in almost linear time.
arXiv Detail & Related papers (2024-05-26T02:59:13Z) - Towards Revealing the Mystery behind Chain of Thought: A Theoretical
Perspective [39.47116013338394]
Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs)
We show that CoT can handle a general class of decision-making problems known as Dynamic Programming.
arXiv Detail & Related papers (2023-05-24T17:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.