Adaptive Multi-Resolution Attention with Linear Complexity
- URL: http://arxiv.org/abs/2108.04962v1
- Date: Tue, 10 Aug 2021 23:17:16 GMT
- Title: Adaptive Multi-Resolution Attention with Linear Complexity
- Authors: Yao Zhang, Yunpu Ma, Thomas Seidl, Volker Tresp
- Abstract summary: We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
- Score: 18.64163036371161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have improved the state-of-the-art across numerous tasks in
sequence modeling. Besides the quadratic computational and memory complexity
w.r.t the sequence length, the self-attention mechanism only processes
information at the same scale, i.e., all attention heads are in the same
resolution, resulting in the limited power of the Transformer. To remedy this,
we propose a novel and efficient structure named Adaptive Multi-Resolution
Attention (AdaMRA for short), which scales linearly to sequence length in terms
of time and space. Specifically, we leverage a multi-resolution multi-head
attention mechanism, enabling attention heads to capture long-range contextual
information in a coarse-to-fine fashion. Moreover, to capture the potential
relations between query representation and clues of different attention
granularities, we leave the decision of which resolution of attention to use to
query, which further improves the model's capacity compared to vanilla
Transformer. In an effort to reduce complexity, we adopt kernel attention
without degrading the performance. Extensive experiments on several benchmarks
demonstrate the effectiveness and efficiency of our model by achieving a
state-of-the-art performance-efficiency-memory trade-off. To facilitate AdaMRA
utilization by the scientific community, the code implementation will be made
publicly available.
Related papers
- Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search [49.81353382211113]
We address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently.
We develop a multi-target multi-branch supernet method, which fully utilizes the advantages of high-resolution features.
We present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers.
arXiv Detail & Related papers (2024-03-15T15:47:54Z) - Multi-Hierarchical Surrogate Learning for Structural Dynamical Crash
Simulations Using Graph Convolutional Neural Networks [5.582881461692378]
We propose a multi-hierarchical framework for structurally creating a series of surrogate models for a kart frame.
For multiscale phenomena, macroscale features are captured on a coarse surrogate, whereas microscale effects are resolved by finer ones.
We train a graph-convolutional neural network-based surrogate that learns parameter-dependent low-dimensional latent dynamics on the coarsest representation.
arXiv Detail & Related papers (2024-02-14T15:22:59Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average.
We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z) - Rethinking Attention Mechanism in Time Series Classification [6.014777261874646]
We promote the efficiency and performance of the attention mechanism by proposing our flexible multi-head linear attention (FMLA)
We propose a simple but effective mask mechanism that helps reduce the noise influence in time series and decrease the redundancy of the proposed FMLA.
We conduct extensive experiments on 85 UCR2018 datasets to compare our algorithm with 11 well-known ones and the results show that our algorithm has comparable performance in terms of top-1 accuracy.
arXiv Detail & Related papers (2022-07-14T07:15:06Z) - A Practical Survey on Faster and Lighter Transformers [0.9176056742068811]
The Transformer is a model solely based on the attention mechanism that is able to relate any two positions of the input sequence.
It has improved the state-of-the-art across numerous sequence modelling tasks.
However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length.
arXiv Detail & Related papers (2021-03-26T17:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.