Related papers: TRA: Better Length Generalisation with Threshold Relative Attention

TRA: Better Length Generalisation with Threshold Relative Attention

URL: http://arxiv.org/abs/2503.23174v4
Date: Mon, 06 Oct 2025 12:50:07 GMT
Title: TRA: Better Length Generalisation with Threshold Relative Attention
Authors: Mattia Opper, Roland Fernandez, Paul Smolensky, Jianfeng Gao,
Abstract summary: Transformers struggle with length generalisation, displaying poor performance even on basic tasks.<n>We test whether these limitations can be explained through two key failures of the self-attention mechanism.<n>We show how the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.
Score: 58.64717643300818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve the generalisation capabilities of decoder only transformers.

Related papers

Decomposing Query-Key Feature Interactions Using Contrastive Covariances [75.38737409771085]
We study the query-key space -- the bilinear joint embedding space between queries and keys.<n>It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced.
arXiv Detail & Related papers (2026-02-04T16:50:02Z)
Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization [3.8776893257232032]
We study length generalization in transformers through the set complement task.<n>A model must predict a uniform distribution over tokens absent from an input sequence.<n>We show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences.
arXiv Detail & Related papers (2025-10-09T15:26:48Z)
Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z)
Focus What Matters: Matchability-Based Reweighting for Local Feature Matching [6.361840891399624]
We propose a novel attention reweighting mechanism that simultaneously incorporates a learnable bias term into the attention logits.<n>Experiments conducted on three benchmark datasets validate the effectiveness of our method.
arXiv Detail & Related papers (2025-05-04T15:50:28Z)
Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing [22.61113682126067]
We identify two biases that detectors may also be prone to overfitting: position bias and content bias. For the position bias, we observe that detectors are prone to lazily depending on the specific positions within an image. As for content bias, we argue that detectors may potentially and mistakenly utilize forgery-unrelated information for detection.
arXiv Detail & Related papers (2025-01-08T09:30:45Z)
Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer [54.97718043685824]
We present the Hadamard Attention Recurrent Stereo Transformer (HART) that incorporates the following components.<n>For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity.<n>We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses.<n>In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission.
arXiv Detail & Related papers (2025-01-02T02:51:16Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.<n>We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.<n> Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Are queries and keys always relevant? A case study on Transformer wave functions [0.0]
dot product attention mechanism, originally designed for natural language processing tasks, is a cornerstone of modern Transformers.<n>We explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions.
arXiv Detail & Related papers (2024-05-29T08:32:37Z)
From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers [7.011373967209572]
We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks.
arXiv Detail & Related papers (2023-10-18T14:10:47Z)
Input-length-shortening and text generation via attention values [1.8222946691865871]
We show that the first layer's attention sums can be used to filter tokens in a given sequence. We also show that retaining approximately 6% of the original sequence is sufficient to obtain 86.5% accuracy.
arXiv Detail & Related papers (2023-03-14T02:11:24Z)
Linear Video Transformer with Feature Fixation [34.324346469406926]
Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. We propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention. We achieve state-of-the-art performance among linear video Transformers on three popular video classification benchmarks.
arXiv Detail & Related papers (2022-10-15T02:20:50Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants. Standard attention heads learn a rigid mapping between search and retrieval. We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.