Related papers: On the Long Range Abilities of Transformers

On the Long Range Abilities of Transformers

URL: http://arxiv.org/abs/2311.16620v1
Date: Tue, 28 Nov 2023 09:21:48 GMT
Title: On the Long Range Abilities of Transformers
Authors: Itamar Zimerman, Lior Wolf
Abstract summary: We demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena benchmark. We identify that two key principles for long-range tasks are (i.e. incorporating an inductive bias towards smoothness, and (ii.e.) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters.
Score: 69.3021852589771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose. In this work, drawing inspiration from key attributes of long-range layers, such as state-space layers, linear RNN layers, and global convolution layers, we demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena (LRA) benchmark, thus narrowing the gap with these specialized layers. We identify that two key principles for long-range tasks are (i) incorporating an inductive bias towards smoothness, and (ii) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters. Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.

Related papers

Small transformer architectures for task switching [2.7195102129095003]
It is non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches.<n>We show that standard transformers cannot solve a basic task switching reference model.<n>We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies.
arXiv Detail & Related papers (2025-08-06T14:01:05Z)
Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation [18.670626228472877]
DIFFT redefines Feature Transformation as a reward-guided generative task.<n>It produces structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation.<n>It consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.
arXiv Detail & Related papers (2025-05-21T06:18:42Z)
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs [40.35884943268004]
We show that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments.
arXiv Detail & Related papers (2025-04-24T17:39:25Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
On the locality bias and results in the Long Range Arena [49.15148871877941]
The Long Range Arena benchmark was designed to evaluate the performance of Transformer improvements. A new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. We show that while the LRA is a benchmark for long-range dependency modeling, in reality most of the performance comes from short-range dependencies.
arXiv Detail & Related papers (2025-01-24T15:34:50Z)
Towards a Deeper Understanding of Transformer for Residential Non-intrusive Load Monitoring [0.0]
This study delves into the effects of the number of hidden dimensions in the attention layer, the number of attention layers, the number of attention heads, and the dropout ratio on transformer performance. It is expected that this work will serve as a foundation for future research and development of more robust and capable transformer models.
arXiv Detail & Related papers (2024-10-02T09:14:50Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models. SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z)
Bidirectional Long-Range Parser for Sequential Data Understanding [3.76054468268713]
We introduce BLRP (Bidirectional Long-Range), a novel and versatile attention mechanism designed to increase performance and efficiency on long-sequence tasks. We show the benefits and versatility of our approach on vision and language domains by demonstrating competitive results against state-of-the-art methods.
arXiv Detail & Related papers (2024-04-08T05:45:03Z)
2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias. We leverage an expressive variation of the multidimensional State Space Model. Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z)
LSG Attention: Extrapolation of pretrained Transformers to long sequences [0.0]
We introduce the LSG architecture which relies on Local, Sparse and Global attention. We show that LSG attention is fast, efficient and competitive in classification and summarization tasks on long documents. We propose tools to train new models and adapt existing ones based on this mechanism.
arXiv Detail & Related papers (2022-10-13T13:10:41Z)
FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain. Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z)
Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks. We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z)
The NLP Task Effectiveness of Long-Range Transformers [38.46467445144777]
Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
arXiv Detail & Related papers (2022-02-16T04:39:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.