DA-Transformer: Distance-aware Transformer
- URL: http://arxiv.org/abs/2010.06925v2
- Date: Sun, 11 Apr 2021 09:01:02 GMT
- Title: DA-Transformer: Distance-aware Transformer
- Authors: Chuhan Wu, Fangzhao Wu, Yongfeng Huang
- Abstract summary: DA-Transformer is a distance-aware Transformer that can exploit the real distance.
In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance.
- Score: 87.20061062572391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer has achieved great success in the NLP field by composing various
advanced models like BERT and GPT. However, Transformer and its existing
variants may not be optimal in capturing token distances because the position
or distance embeddings used by these methods usually cannot keep the precise
information of real distances, which may not be beneficial for modeling the
orders and relations of contexts. In this paper, we propose DA-Transformer,
which is a distance-aware Transformer that can exploit the real distance. We
propose to incorporate the real distances between tokens to re-scale the raw
self-attention weights, which are computed by the relevance between attention
query and key. Concretely, in different self-attention heads the relative
distance between each pair of tokens is weighted by different learnable
parameters, which control the different preferences on long- or short-term
information of these heads. Since the raw weighted real distances may not be
optimal for adjusting self-attention weights, we propose a learnable sigmoid
function to map them into re-scaled coefficients that have proper ranges. We
first clip the raw self-attention weights via the ReLU function to keep
non-negativity and introduce sparsity, and then multiply them with the
re-scaled coefficients to encode real distance information into self-attention.
Extensive experiments on five benchmark datasets show that DA-Transformer can
effectively improve the performance of many tasks and outperform the vanilla
Transformer and its several variants.
Related papers
- Do Efficient Transformers Really Save Computation? [34.15764596496696]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer.
Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size.
We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences.
We introduce a relative position embedding to explicitly maximize attention resolution.
We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - The NLP Task Effectiveness of Long-Range Transformers [38.46467445144777]
Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity.
We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets.
We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
arXiv Detail & Related papers (2022-02-16T04:39:35Z) - Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers.
Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head.
Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention.
We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys.
We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.