Probability Distributions Computed by Hard-Attention Transformers
- URL: http://arxiv.org/abs/2510.27118v1
- Date: Fri, 31 Oct 2025 02:41:05 GMT
- Title: Probability Distributions Computed by Hard-Attention Transformers
- Authors: Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang,
- Abstract summary: We show that making transformer language recognizers autoregressive can sometimes increase their expressivity.<n>Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
- Score: 53.17368795629463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
Related papers
- Transformers are Inherently Succinct [46.836122954309566]
We prove that transformers are highly expressive in that they can represent formal languages substantially more succinctly than standard representations of formal languages.<n>As a by-product of this expressivity, we show that verifying properties of transformers is provably intractable.
arXiv Detail & Related papers (2025-10-22T07:25:54Z) - Characterizing the Expressivity of Transformer Language Models [56.598551673153366]
We provide an exact characterization of fixed-precision transformers with strict future masking and soft attention.<n>We show that these models are precisely as expressive as a specific fragment of linear temporal logic.<n>We further relate this logic to established classes in formal language theory, automata theory, and algebra.
arXiv Detail & Related papers (2025-05-29T16:30:30Z) - Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators [0.40792653193642503]
Large Language Models (LLMs) powered by Transformers have demonstrated human-like intelligence capabilities.<n>This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators.
arXiv Detail & Related papers (2025-04-15T04:06:27Z) - Transformers Simulate MLE for Sequence Generation in Bayesian Networks [18.869174453242383]
We investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE)<n>We demonstrate that there exists a simple transformer model that can estimate the conditional probabilities of the Bayesian network according to the context.<n>We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training.
arXiv Detail & Related papers (2025-01-05T13:56:51Z) - Extracting Moore Machines from Transformers using Queries and Counterexamples [6.612713406498215]
We construct finite state automata as high-level abstractions of transformers trained on regular languages.<n>We extract Moore machines, as many training tasks used in literature can be mapped onto them.<n>We demonstrate the usefulness of this approach by studying positive-only learning and the sequence accuracy measure in detail.
arXiv Detail & Related papers (2024-10-08T13:43:50Z) - A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism.
Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model.
We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z) - Transformers Can Represent $n$-gram Language Models [56.06361029539347]
We focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models.
We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM.
arXiv Detail & Related papers (2024-04-23T12:51:37Z) - Probabilistic Transformer: A Probabilistic Dependency Model for
Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective.
We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention.
Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z) - On the Ability and Limitations of Transformers to Recognize Formal
Languages [9.12267978757844]
We provide a construction of Transformers for a subclass of counter languages.
We find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction.
Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance.
arXiv Detail & Related papers (2020-09-23T17:21:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.