A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
- URL: http://arxiv.org/abs/2503.03961v2
- Date: Thu, 22 May 2025 20:26:28 GMT
- Title: A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
- Authors: William Merrill, Ashish Sabharwal,
- Abstract summary: Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded.<n>We show even highly uniform transformers with depth $Theta(log n)$ can express two important problems.<n>We quantitatively predict how depth must grow with input length to express these problems.
- Score: 29.839710738657203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer's depth affects its expressive power. We address these questions by analyzing transformers whose depth can grow minimally with context length $n$. We show even highly uniform transformers with depth $\Theta(\log n)$ can express two important problems: recognizing regular languages, which captures state tracking abilities and was known to be expressible only by an unconventional, non-uniform model of transformers, and graph connectivity, which underlies multi-step reasoning. Notably, both of these problems cannot be expressed by fixed-depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain-of-thought steps. Empirically, our detailed experiments designed to bridge the expressivity vs. learnability gap reveal that our theoretical depth requirements for regular language recognition closely match the practical depth requirements for successfully training transformers. Thus, our results clarify how depth affects a transformer's reasoning capabilities, and provide practical guidance for effective depth selection for sequential reasoning.
Related papers
- Knee-Deep in C-RASP: A Transformer Depth Hierarchy [7.9266383017424795]
We consider transformers that round to fixed precision except inside attention.<n>We show that this subclass of transformers is expressively equivalent to the programming language C-RASP.<n>Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs.
arXiv Detail & Related papers (2025-06-19T06:27:54Z) - Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights [47.62295798627317]
This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold.<n>We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold.<n>Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise.
arXiv Detail & Related papers (2025-05-06T05:41:46Z) - On the Robustness of Transformers against Context Hijacking for Linear Classification [26.1838836907147]
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities.<n>They can be disrupted by factually correct context, a phenomenon known as context hijacking.<n>We show that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations.
arXiv Detail & Related papers (2025-02-21T17:31:00Z) - Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers [5.4649464326326]
Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers.<n>In this work, we initiate the study of systematic lower bounds for the number of CoT steps across different algorithmic problems.
arXiv Detail & Related papers (2025-02-04T15:14:01Z) - On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks.
This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks.
These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer.
Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z) - On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Faster Depth-Adaptive Transformers [71.20237659479703]
Depth-adaptive neural networks can dynamically adjust depths according to the hardness of input words.
Previous works generally build a halting unit to decide whether the computation should continue or stop at each layer.
In this paper, we get rid of the halting unit and estimate the required depths in advance, which yields a faster depth-adaptive model.
arXiv Detail & Related papers (2020-04-27T15:08:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.