Related papers: The NLP Task Effectiveness of Long-Range Transformers

The NLP Task Effectiveness of Long-Range Transformers

URL: http://arxiv.org/abs/2202.07856v1
Date: Wed, 16 Feb 2022 04:39:35 GMT
Title: The NLP Task Effectiveness of Long-Range Transformers
Authors: Guanghui Qin, Yukun Feng, Benjamin Van Durme
Abstract summary: Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
Score: 38.46467445144777
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lessen computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors, to illuminate model details beyond metric scores. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
On the Long Range Abilities of Transformers [69.3021852589771]
We demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena benchmark. We identify that two key principles for long-range tasks are (i.e. incorporating an inductive bias towards smoothness, and (ii.e.) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters.
arXiv Detail & Related papers (2023-11-28T09:21:48Z)
Manifold-Preserving Transformers are Effective for Short-Long Range Encoding [39.14128923434994]
Multi-head self-attention-based Transformers have shown promise in different learning tasks. We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
arXiv Detail & Related papers (2023-10-22T06:58:28Z)
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z)
Robust representations of oil wells' intervals via sparse attention mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers) The focus in our experiments is on oil&gas data, namely, well logs. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z)
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks. We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z)
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity. Recent studies have shown the potential of Transformer to increase the prediction capacity. We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z)
Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens. We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
DA-Transformer: Distance-aware Transformer [87.20061062572391]
DA-Transformer is a distance-aware Transformer that can exploit the real distance. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance.
arXiv Detail & Related papers (2020-10-14T10:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.