Related papers: DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention

DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention

URL: http://arxiv.org/abs/2211.16368v1
Date: Thu, 24 Nov 2022 03:06:36 GMT
Title: DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention
Authors: Bosheng Qin, Juncheng Li, Siliang Tang, Yueting Zhuang
Abstract summary: We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA) DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
Score: 53.02648818164273
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many studies have been conducted to improve the efficiency of Transformer from quadric to linear. Among them, the low-rank-based methods aim to learn the projection matrices to compress the sequence length. However, the projection matrices are fixed once they have been learned, which compress sequence length with dedicated coefficients for tokens in the same position. Adopting such input-invariant projections ignores the fact that the most informative part of a sequence varies from sequence to sequence, thus failing to preserve the most useful information that lies in varied positions. In addition, previous efficient Transformers only focus on the influence of sequence length while neglecting the effect of hidden state dimension. To address the aforementioned problems, we present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA), which compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity by jointly optimizing the sequence length and hidden state dimension while maintaining state-of-the-art performance. Specifically, we first theoretically demonstrate that the sequence length can be compressed non-destructively from a novel perspective of information theory, with compression matrices dynamically determined by the input sequence. Furthermore, we show that the hidden state dimension can be approximated by extending the Johnson-Lindenstrauss lemma, optimizing the attention in bilinear form. Theoretical analysis shows that DBA is proficient in capturing high-order relations in cross-attention problems. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance compared with various strong baselines while maintaining less memory consumption with higher speed.

Related papers

Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation [1.8999296421549172]
We introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation.<n>The LPG generates skeletal position information that complements the input 2D joint positions.<n>Our model achieves state-of-the-art performance with improved computational efficiency.
arXiv Detail & Related papers (2025-05-27T15:08:03Z)
Low-Bit Integerization of Vision Transformers using Operand Reordering for Efficient Hardware [0.7136205674624813]
We analyze the computation graph and propose an integerization process based on operation reordering.<n>This enables integerized matrix multiplication and linear module by directly processing the quantized input.<n> Experimental results show that our low-bit inference reduces per-PE power consumption for linear layer and matrix multiplication.
arXiv Detail & Related papers (2025-04-11T16:09:54Z)
Reweighted Time-Evolving Block Decimation for Improved Quantum Dynamics Simulations [0.0]
We introduce a simple yet significant improvement to the time-evolving block decimation (TEBD) algorithm for simulating the time dynamics of 1D mixed quantum states. We propose a reweighted TEBD algorithm that deprioritizes high-weight expectation values by a factor of $gamma-n$ during the truncation. This simple modification makes rTEBD significantly more accurate than the TEBD time-dependent simulation of an MPDO, and competive with and sometimes better than TEBD using MPS.
arXiv Detail & Related papers (2024-12-11T19:01:00Z)
LOCAL: Learning with Orientation Matrix to Infer Causal Structure from Time Series Data [13.390666123493409]
LOCAL is a highly efficient, easy-to-implement, and constraint-free method for recovering dynamic causal structures. ACML generates causal masks using learnable priority vectors and the Gumbel-Sigmoid function. DGPL transforms causal learning into decomposed matrix products, capturing the dynamic causal structure of high-dimensional data.
arXiv Detail & Related papers (2024-10-25T10:48:41Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z)
Triformer: Triangular, Variable-Specific Attentions for Long Sequence Multivariate Time Series Forecasting--Full Version [50.43914511877446]
We propose a triangular, variable-specific attention to ensure high efficiency and accuracy. We show that Triformer outperforms state-of-the-art methods w.r.t. both accuracy and efficiency.
arXiv Detail & Related papers (2022-04-28T20:41:49Z)
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z)
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity. Recent studies have shown the potential of Transformer to increase the prediction capacity. We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z)
Multi-Objective Matrix Normalization for Fine-grained Visual Recognition [153.49014114484424]
Bilinear pooling achieves great success in fine-grained visual recognition (FGVC) Recent methods have shown that the matrix power normalization can stabilize the second-order information in bilinear features. We propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation.
arXiv Detail & Related papers (2020-03-30T08:40:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.