Formal Language Recognition by Hard Attention Transformers: Perspectives
from Circuit Complexity
- URL: http://arxiv.org/abs/2204.06618v1
- Date: Wed, 13 Apr 2022 19:25:42 GMT
- Title: Formal Language Recognition by Hard Attention Transformers: Perspectives
from Circuit Complexity
- Authors: Yiding Hao, Dana Angluin, and Robert Frank
- Abstract summary: We show that UHAT and GUHAT Transformers, viewed as string acceptors, can only recognize formal languages in the complexity class AC$0$.
In contrast, the non-AC$0$ languages MAJORITY and DYCK-1 are recognizable by AHAT networks, implying that AHAT can recognize languages that UHAT and GUHAT cannot.
- Score: 1.0159205678719043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper analyzes three formal models of Transformer encoders that differ
in the form of their self-attention mechanism: unique hard attention (UHAT);
generalized unique hard attention (GUHAT), which generalizes UHAT; and
averaging hard attention (AHAT). We show that UHAT and GUHAT Transformers,
viewed as string acceptors, can only recognize formal languages in the
complexity class AC$^0$, the class of languages recognizable by families of
Boolean circuits of constant depth and polynomial size. This upper bound
subsumes Hahn's (2020) results that GUHAT cannot recognize the DYCK languages
or the PARITY language, since those languages are outside AC$^0$ (Furst et al.,
1984). In contrast, the non-AC$^0$ languages MAJORITY and DYCK-1 are
recognizable by AHAT networks, implying that AHAT can recognize languages that
UHAT and GUHAT cannot.
Related papers
- Probability Distributions Computed by Hard-Attention Transformers [53.17368795629463]
We show that making transformer language recognizers autoregressive can sometimes increase their expressivity.<n>Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
arXiv Detail & Related papers (2025-10-31T02:41:05Z) - Comparison of different Unique hard attention transformer models by the formal languages they can recognize [0.0]
We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions.<n>We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.
arXiv Detail & Related papers (2025-06-03T20:28:51Z) - Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism.
Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model.
We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z) - Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages [7.938342455750221]
We study exact characterizations of transformers with hard attention and attention masking.
With strict masking (each position cannot attend to itself) and without position embeddings, these transformers are expressively equivalent to linear temporal logic.
arXiv Detail & Related papers (2023-10-21T03:26:39Z) - Logical Languages Accepted by Transformer Encoders with Hard Attention [45.14832807541816]
We focus on two self-attention mechanisms: (1) UHAT (Unique Hard Attention Transformers) and (2) AHAT (Average Hard Attention Transformers)
UHAT encoders are known to recognize only languages inside the circuit complexity class $sf AC0$, i.e., accepted by a family of poly-sized and depth-bounded circuits with unbounded fan-ins.
AHAT encoders can recognize languages outside $sf AC0$, but their expressive power still lies within the bigger circuit complexity class $sf TC0$, i.e., $sf AC0$
arXiv Detail & Related papers (2023-10-05T18:13:40Z) - Linearity of Relation Decoding in Transformer Language Models [82.47019600662874]
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations.
We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation.
arXiv Detail & Related papers (2023-08-17T17:59:19Z) - Average-Hard Attention Transformers are Constant-Depth Uniform Threshold
Circuits [0.0]
We show that average-hard attention transformers recognize languages that fall within the class TC0.
Our paper shows that the first result can be extended to yield uniform circuits as well.
arXiv Detail & Related papers (2023-08-06T21:23:22Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - On the Ability and Limitations of Transformers to Recognize Formal
Languages [9.12267978757844]
We provide a construction of Transformers for a subclass of counter languages.
We find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction.
Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance.
arXiv Detail & Related papers (2020-09-23T17:21:33Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.