Characterizing the Expressivity of Transformer Language Models
- URL: http://arxiv.org/abs/2505.23623v1
- Date: Thu, 29 May 2025 16:30:30 GMT
- Title: Characterizing the Expressivity of Transformer Language Models
- Authors: Jiaoda Li, Ryan Cotterell,
- Abstract summary: We provide an exact characterization of fixed-precision transformers with strict future masking and soft attention.<n>We show that these models are precisely as expressive as a specific fragment of linear temporal logic.<n>We further relate this logic to established classes in formal language theory, automata theory, and algebra.
- Score: 56.598551673153366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. Prior work often relies on idealized models with assumptions -- such as arbitrary numerical precision and hard attention -- that diverge from real-world transformers. In this work, we provide an exact characterization of fixed-precision transformers with strict future masking and soft attention, an idealization that more closely mirrors practical implementations. We show that these models are precisely as expressive as a specific fragment of linear temporal logic that includes only a single temporal operator: the past operator. We further relate this logic to established classes in formal language theory, automata theory, and algebra, yielding a rich and unified theoretical framework for understanding transformer expressivity. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their theoretical capacity generalize perfectly over lengths, while they consistently fail to generalize on languages beyond it.
Related papers
- A Free Probabilistic Framework for Analyzing the Transformer-based Language Models [19.78896931593813]
We present a formal operator-theoretic framework for analyzing Transformer-based language models using free probability theory.<n>This work offers a principled, though theoretical, perspective on structural dynamics in large language models.
arXiv Detail & Related papers (2025-06-19T19:13:02Z) - Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators [0.40792653193642503]
Large Language Models (LLMs) powered by Transformers have demonstrated human-like intelligence capabilities.<n>This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators.
arXiv Detail & Related papers (2025-04-15T04:06:27Z) - The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task.<n>We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens.<n>We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z) - Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.
This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z) - A Formal Framework for Understanding Length Generalization in Transformers [14.15513446489798]
We introduce a rigorous theoretical framework to analyze length generalization in causal transformers.<n>We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks.
arXiv Detail & Related papers (2024-10-03T01:52:01Z) - Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training.
Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking.
Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z) - Clustering in pure-attention hardmax transformers and its role in sentiment analysis [0.0]
We rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity.
We show that the transformer inputsally converge to a clustered equilibrium determined by special points called leaders.
We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model.
arXiv Detail & Related papers (2024-06-26T16:13:35Z) - Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.<n>We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z) - Linearity of Relation Decoding in Transformer Language Models [82.47019600662874]
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations.
We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation.
arXiv Detail & Related papers (2023-08-17T17:59:19Z) - On the Ability and Limitations of Transformers to Recognize Formal
Languages [9.12267978757844]
We provide a construction of Transformers for a subclass of counter languages.
We find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction.
Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance.
arXiv Detail & Related papers (2020-09-23T17:21:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.