Related papers: Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

URL: http://arxiv.org/abs/2508.08222v1
Date: Mon, 11 Aug 2025 17:40:47 GMT
Title: Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent
Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi,
Abstract summary: This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes.<n>We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task.<n>We show that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees.
Score: 66.78052387054593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures.

Related papers

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data [4.344634631420729]
We analyze the policy gradient dynamics of single-layer Transformers trained via Reinforcement Learning.<n>We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm.
arXiv Detail & Related papers (2026-01-21T16:36:19Z)
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought [64.43689151961054]
We theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem.<n>Our analysis reveals that during training using continuous thought, the index-matching logit will first increase and then remain bounded under mild assumptions.
arXiv Detail & Related papers (2025-09-27T15:23:46Z)
Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models [12.112842686827669]
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks.<n>We investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability.<n>Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.
arXiv Detail & Related papers (2025-06-02T17:39:31Z)
How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z)
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z)
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning [16.65073455206535]
We study a minimal propositional logic problem that requires combining multiple facts to arrive at a solution.<n>By studying this problem on Mistral and Gemma models, up to 27B parameters, we illuminate the core components the models use to solve such logic problems.<n>We offer fine-grained insights into the functions of attention heads in different layers.
arXiv Detail & Related papers (2024-11-06T18:35:32Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [22.033370572209744]
We study whether transformers can learn to implicitly reason over parametric knowledge. We focus on two representative reasoning types, composition and comparison. We find that transformers can learn implicit reasoning, but only through grokking.
arXiv Detail & Related papers (2024-05-23T21:42:19Z)
Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.