DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention
- URL: http://arxiv.org/abs/2211.16368v1
- Date: Thu, 24 Nov 2022 03:06:36 GMT
- Title: DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention
- Authors: Bosheng Qin, Juncheng Li, Siliang Tang, Yueting Zhuang
- Abstract summary: We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
- Score: 53.02648818164273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many studies have been conducted to improve the efficiency of Transformer
from quadric to linear. Among them, the low-rank-based methods aim to learn the
projection matrices to compress the sequence length. However, the projection
matrices are fixed once they have been learned, which compress sequence length
with dedicated coefficients for tokens in the same position. Adopting such
input-invariant projections ignores the fact that the most informative part of
a sequence varies from sequence to sequence, thus failing to preserve the most
useful information that lies in varied positions. In addition, previous
efficient Transformers only focus on the influence of sequence length while
neglecting the effect of hidden state dimension. To address the aforementioned
problems, we present an efficient yet effective attention mechanism, namely the
Dynamic Bilinear Low-Rank Attention (DBA), which compresses the sequence length
by input-sensitive dynamic projection matrices and achieves linear time and
space complexity by jointly optimizing the sequence length and hidden state
dimension while maintaining state-of-the-art performance. Specifically, we
first theoretically demonstrate that the sequence length can be compressed
non-destructively from a novel perspective of information theory, with
compression matrices dynamically determined by the input sequence. Furthermore,
we show that the hidden state dimension can be approximated by extending the
Johnson-Lindenstrauss lemma, optimizing the attention in bilinear form.
Theoretical analysis shows that DBA is proficient in capturing high-order
relations in cross-attention problems. Experiments over tasks with diverse
sequence length conditions show that DBA achieves state-of-the-art performance
compared with various strong baselines while maintaining less memory
consumption with higher speed.
Related papers
- LOCAL: Learning with Orientation Matrix to Infer Causal Structure from Time Series Data [13.390666123493409]
LOCAL is a highly efficient, easy-to-implement, and constraint-free method for recovering dynamic causal structures.
ACML generates causal masks using learnable priority vectors and the Gumbel-Sigmoid function.
DGPL transforms causal learning into decomposed matrix products, capturing the dynamic causal structure of high-dimensional data.
arXiv Detail & Related papers (2024-10-25T10:48:41Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts.
Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods.
We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z) - Triformer: Triangular, Variable-Specific Attentions for Long Sequence
Multivariate Time Series Forecasting--Full Version [50.43914511877446]
We propose a triangular, variable-specific attention to ensure high efficiency and accuracy.
We show that Triformer outperforms state-of-the-art methods w.r.t. both accuracy and efficiency.
arXiv Detail & Related papers (2022-04-28T20:41:49Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - Informer: Beyond Efficient Transformer for Long Sequence Time-Series
Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity.
Recent studies have shown the potential of Transformer to increase the prediction capacity.
We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z) - Multi-Objective Matrix Normalization for Fine-grained Visual Recognition [153.49014114484424]
Bilinear pooling achieves great success in fine-grained visual recognition (FGVC)
Recent methods have shown that the matrix power normalization can stabilize the second-order information in bilinear features.
We propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation.
arXiv Detail & Related papers (2020-03-30T08:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.