Related papers: DA-Transformer: Distance-aware Transformer

DA-Transformer: Distance-aware Transformer

URL: http://arxiv.org/abs/2010.06925v2
Date: Sun, 11 Apr 2021 09:01:02 GMT
Title: DA-Transformer: Distance-aware Transformer
Authors: Chuhan Wu, Fangzhao Wu, Yongfeng Huang
Abstract summary: DA-Transformer is a distance-aware Transformer that can exploit the real distance. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance.
Score: 87.20061062572391
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually cannot keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or short-term information of these heads. Since the raw weighted real distances may not be optimal for adjusting self-attention weights, we propose a learnable sigmoid function to map them into re-scaled coefficients that have proper ranges. We first clip the raw self-attention weights via the ReLU function to keep non-negativity and introduce sparsity, and then multiply them with the re-scaled coefficients to encode real distance information into self-attention. Extensive experiments on five benchmark datasets show that DA-Transformer can effectively improve the performance of many tasks and outperform the vanilla Transformer and its several variants.

Related papers

A temporal scale transformer framework for precise remaining useful life prediction in fuel cells [10.899223392837936]
Temporal Scale Transformer (TSTransformer) is an enhanced version of the inverted Transformer (iTransformer) Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling. It improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs.
arXiv Detail & Related papers (2025-04-08T23:42:54Z)
Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z)
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions. The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
The NLP Task Effectiveness of Long-Range Transformers [38.46467445144777]
Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
arXiv Detail & Related papers (2022-02-16T04:39:35Z)
Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers. Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.