The NLP Task Effectiveness of Long-Range Transformers
- URL: http://arxiv.org/abs/2202.07856v1
- Date: Wed, 16 Feb 2022 04:39:35 GMT
- Title: The NLP Task Effectiveness of Long-Range Transformers
- Authors: Guanghui Qin, Yukun Feng, Benjamin Van Durme
- Abstract summary: Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity.
We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets.
We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
- Score: 38.46467445144777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models cannot easily scale to long sequences due to their O(N^2)
time and space complexity. This has led to Transformer variants seeking to
lessen computational complexity, such as Longformer and Performer. While such
models have theoretically greater efficiency, their effectiveness on real NLP
tasks has not been well studied. We benchmark 7 variants of Transformer models
on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the
effect of pretraining and hyperparameter settings, to focus on their capacity
for long-range attention. Moreover, we present various methods to investigate
attention behaviors, to illuminate model details beyond metric scores. We find
that attention of long-range transformers has advantages on content selection
and query-guided decoding, but they come with previously unrecognized drawbacks
such as insufficient attention to distant tokens.
Related papers
- On the Long Range Abilities of Transformers [69.3021852589771]
We demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena benchmark.
We identify that two key principles for long-range tasks are (i.e. incorporating an inductive bias towards smoothness, and (ii.e.) locality.
As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters.
arXiv Detail & Related papers (2023-11-28T09:21:48Z) - Manifold-Preserving Transformers are Effective for Short-Long Range
Encoding [39.14128923434994]
Multi-head self-attention-based Transformers have shown promise in different learning tasks.
We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
arXiv Detail & Related papers (2023-10-22T06:58:28Z) - Efficient Long-Range Transformers: You Need to Attend More, but Not
Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans.
MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers.
Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z) - Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - Informer: Beyond Efficient Transformer for Long Sequence Time-Series
Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity.
Recent studies have shown the potential of Transformer to increase the prediction capacity.
We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - DA-Transformer: Distance-aware Transformer [87.20061062572391]
DA-Transformer is a distance-aware Transformer that can exploit the real distance.
In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance.
arXiv Detail & Related papers (2020-10-14T10:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.