Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
- URL: http://arxiv.org/abs/2601.15158v2
- Date: Thu, 29 Jan 2026 13:30:35 GMT
- Title: Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
- Authors: Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen,
- Abstract summary: We analyze the policy gradient dynamics of single-layer Transformers trained via Reinforcement Learning.<n>We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm.
- Score: 4.344634631420729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.
Related papers
- From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers [67.02076505996284]
We study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other.<n>Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.
arXiv Detail & Related papers (2025-12-21T08:10:26Z) - Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z) - When Do Transformers Learn Heuristics for Graph Connectivity? [33.73385470817422]
We prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3L$.<n>We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity.
arXiv Detail & Related papers (2025-10-22T16:43:32Z) - Cross-Entropy Is All You Need To Invert the Data Generating Process [29.94396019742267]
Empirical phenomena suggest that supervised models can learn interpretable factors of variation in a linear fashion.<n>Recent advances in self-supervised learning have shown that these methods can recover latent structures by inverting the data generating process.<n>We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation.
arXiv Detail & Related papers (2024-10-29T09:03:57Z) - Transformers Provably Solve Parity Efficiently with Chain of Thought [40.78854925996]
This work provides the first theoretical analysis of training transformers to solve complex problems.<n>We consider training a one-layer transformer to solve the fundamental $k$-parity problem.
arXiv Detail & Related papers (2024-10-11T08:55:17Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527835]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.<n>We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Towards an Understanding of Stepwise Inference in Transformers: A
Synthetic Graph Navigation Model [19.826983068662106]
We propose to study autoregressive Transformer models on a synthetic task that embodies the multi-step nature of problems where stepwise inference is generally most useful.
Despite is simplicity, we find we can empirically reproduce and analyze several phenomena observed at scale.
arXiv Detail & Related papers (2024-02-12T16:25:47Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.