On the Long Range Abilities of Transformers
- URL: http://arxiv.org/abs/2311.16620v1
- Date: Tue, 28 Nov 2023 09:21:48 GMT
- Title: On the Long Range Abilities of Transformers
- Authors: Itamar Zimerman, Lior Wolf
- Abstract summary: We demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena benchmark.
We identify that two key principles for long-range tasks are (i.e. incorporating an inductive bias towards smoothness, and (ii.e.) locality.
As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters.
- Score: 69.3021852589771
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their dominance in modern DL and, especially, NLP domains,
transformer architectures exhibit sub-optimal performance on long-range tasks
compared to recent layers that are specifically designed for this purpose. In
this work, drawing inspiration from key attributes of long-range layers, such
as state-space layers, linear RNN layers, and global convolution layers, we
demonstrate that minimal modifications to the transformer architecture can
significantly enhance performance on the Long Range Arena (LRA) benchmark, thus
narrowing the gap with these specialized layers. We identify that two key
principles for long-range tasks are (i) incorporating an inductive bias towards
smoothness, and (ii) locality. As we show, integrating these ideas into the
attention mechanism improves results with a negligible amount of additional
computation and without any additional trainable parameters. Our theory and
experiments also shed light on the reasons for the inferior performance of
transformers on long-range tasks and identify critical properties that are
essential for successfully capturing long-range dependencies.
Related papers
- Towards a Deeper Understanding of Transformer for Residential Non-intrusive Load Monitoring [0.0]
This study delves into the effects of the number of hidden dimensions in the attention layer, the number of attention layers, the number of attention heads, and the dropout ratio on transformer performance.
It is expected that this work will serve as a foundation for future research and development of more robust and capable transformer models.
arXiv Detail & Related papers (2024-10-02T09:14:50Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Bidirectional Long-Range Parser for Sequential Data Understanding [3.76054468268713]
We introduce BLRP (Bidirectional Long-Range), a novel and versatile attention mechanism designed to increase performance and efficiency on long-sequence tasks.
We show the benefits and versatility of our approach on vision and language domains by demonstrating competitive results against state-of-the-art methods.
arXiv Detail & Related papers (2024-04-08T05:45:03Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - LSG Attention: Extrapolation of pretrained Transformers to long
sequences [0.0]
We introduce the LSG architecture which relies on Local, Sparse and Global attention.
We show that LSG attention is fast, efficient and competitive in classification and summarization tasks on long documents.
We propose tools to train new models and adapt existing ones based on this mechanism.
arXiv Detail & Related papers (2022-10-13T13:10:41Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - The NLP Task Effectiveness of Long-Range Transformers [38.46467445144777]
Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity.
We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets.
We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
arXiv Detail & Related papers (2022-02-16T04:39:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.