Comparison of different Unique hard attention transformer models by the formal languages they can recognize
- URL: http://arxiv.org/abs/2506.03370v1
- Date: Tue, 03 Jun 2025 20:28:51 GMT
- Title: Comparison of different Unique hard attention transformer models by the formal languages they can recognize
- Authors: Leonid Ryvkin,
- Abstract summary: We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions.<n>We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This note is a survey of various results on the capabilities of unique hard attention transformers encoders (UHATs) to recognize formal languages. We distinguish between masked vs. non-masked, finite vs. infinite image and general vs. bilinear attention score functions. We recall some relations between these models, as well as a lower bound in terms of first-order logic and an upper bound in terms of circuit complexity.
Related papers
- Probability Distributions Computed by Hard-Attention Transformers [53.17368795629463]
We show that making transformer language recognizers autoregressive can sometimes increase their expressivity.<n>Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
arXiv Detail & Related papers (2025-10-31T02:41:05Z) - The Impossibility of Inverse Permutation Learning in Transformer Models [15.463427361201914]
We study the problem of inverse permutation learning in decoder-only transformers.<n>We show that an arbitrary depth, decoder-only transformer cannot learn this task.<n>We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate thinking'' tokens can enable reasoning.
arXiv Detail & Related papers (2025-09-28T23:48:11Z) - Interference Matrix: Quantifying Cross-Lingual Interference in Transformer Encoders [55.749883010057545]
We construct an interference matrix by training and evaluating small BERT-like models on all possible language pairs.<n>Our analysis reveals that interference between languages is asymmetrical and that its patterns do not align with traditional linguistic characteristics.
arXiv Detail & Related papers (2025-08-04T10:02:19Z) - Comateformer: Combined Attention Transformer for Semantic Sentence Matching [11.746010399185437]
We propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer)<n>In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties.<n>Our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores.
arXiv Detail & Related papers (2024-12-10T06:18:07Z) - Linearity of Relation Decoding in Transformer Language Models [82.47019600662874]
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations.
We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation.
arXiv Detail & Related papers (2023-08-17T17:59:19Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - Join-Chain Network: A Logical Reasoning View of the Multi-head Attention
in Transformer [59.73454783958702]
We propose a symbolic reasoning architecture that chains many join operators together to model output logical expressions.
In particular, we demonstrate that such an ensemble of join-chains can express a broad subset of ''tree-structured'' first-order logical expressions, named FOET.
We find that the widely used multi-head self-attention module in transformer can be understood as a special neural operator that implements the union bound of the join operator in probabilistic predicate space.
arXiv Detail & Related papers (2022-10-06T07:39:58Z) - Formal Language Recognition by Hard Attention Transformers: Perspectives
from Circuit Complexity [1.0159205678719043]
We show that UHAT and GUHAT Transformers, viewed as string acceptors, can only recognize formal languages in the complexity class AC$0$.
In contrast, the non-AC$0$ languages MAJORITY and DYCK-1 are recognizable by AHAT networks, implying that AHAT can recognize languages that UHAT and GUHAT cannot.
arXiv Detail & Related papers (2022-04-13T19:25:42Z) - Leveraging redundancy in attention with Reuse Transformers [58.614198953733194]
Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way.
A typical Transformer model computes such pairwise attention scores repeatedly for the same sequence.
We propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers.
arXiv Detail & Related papers (2021-10-13T16:08:02Z) - Incorporating Residual and Normalization Layers into Analysis of Masked
Language Models [29.828669678974983]
We extend the scope of the analysis of Transformers from solely the attention patterns to the whole attention block.
Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed.
arXiv Detail & Related papers (2021-09-15T08:32:20Z) - On the Ability and Limitations of Transformers to Recognize Formal
Languages [9.12267978757844]
We provide a construction of Transformers for a subclass of counter languages.
We find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction.
Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance.
arXiv Detail & Related papers (2020-09-23T17:21:33Z) - Hard Non-Monotonic Attention for Character-Level Transduction [65.17388794270694]
We introduce an exact, exponential-time algorithm for marginalizing over a number of non-monotonic alignments between two strings.
We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the approximation and outperforms soft attention.
arXiv Detail & Related papers (2018-08-29T20:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.