Exploring Attention Map Reuse for Efficient Transformer Neural Networks
- URL: http://arxiv.org/abs/2301.12444v1
- Date: Sun, 29 Jan 2023 13:38:45 GMT
- Title: Exploring Attention Map Reuse for Efficient Transformer Neural Networks
- Authors: Kyuhong Shim, Jungwook Choi, Wonyong Sung
- Abstract summary: Transformer-based deep neural networks have achieved great success in various sequence applications.
Key module is self-attention (SA) which extracts features from the entire sequence regardless of the distance between positions.
Recently, attention map reuse, which groups multiple SA layers to share one attention map, has been proposed and achieved significant speedup for speech recognition models.
- Score: 18.335207404178547
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer-based deep neural networks have achieved great success in various
sequence applications due to their powerful ability to model long-range
dependency. The key module of Transformer is self-attention (SA) which extracts
features from the entire sequence regardless of the distance between positions.
Although SA helps Transformer performs particularly well on long-range tasks,
SA requires quadratic computation and memory complexity with the input sequence
length. Recently, attention map reuse, which groups multiple SA layers to share
one attention map, has been proposed and achieved significant speedup for
speech recognition models. In this paper, we provide a comprehensive study on
attention map reuse focusing on its ability to accelerate inference. We compare
the method with other SA compression techniques and conduct a breakdown
analysis of its advantages for a long sequence. We demonstrate the
effectiveness of attention map reuse by measuring the latency on both CPU and
GPU platforms.
Related papers
- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [21.808835887740543]
We propose SageAttention, a highly efficient and accurate quantization method for attention.
Our approach incurs almost no end-to-end metrics loss across diverse models.
arXiv Detail & Related papers (2024-10-03T10:25:23Z) - CARD: Channel Aligned Robust Blend Transformer for Time Series
Forecasting [50.23240107430597]
We design a special Transformer, i.e., Channel Aligned Robust Blend Transformer (CARD for short), that addresses key shortcomings of CI type Transformer in time series forecasting.
First, CARD introduces a channel-aligned attention structure that allows it to capture both temporal correlations among signals.
Second, in order to efficiently utilize the multi-scale knowledge, we design a token blend module to generate tokens with different resolutions.
Third, we introduce a robust loss function for time series forecasting to alleviate the potential overfitting issue.
arXiv Detail & Related papers (2023-05-20T05:16:31Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention
Mechanisms for Long Sequences [16.332650428422443]
We propose SALO to enable hybrid sparse attention mechanisms for long sequences.
SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator.
We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations.
arXiv Detail & Related papers (2022-06-29T12:01:19Z) - Efficient Long-Range Attention Network for Image Super-resolution [25.51377161557467]
We propose an efficient long-range attention network (ELAN) for image super-resolution (SR)
We first employ shift convolution (shift-conv) to effectively extract the image local structural information while maintaining the same level of complexity as 1x1 convolution.
A highly efficient long-range attention block (ELAB) is then built by simply cascading two shift-conv with a GMSA module.
arXiv Detail & Related papers (2022-03-13T16:17:48Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - SA-Net: Shuffle Attention for Deep Convolutional Neural Networks [0.0]
We propose an efficient Shuffle Attention (SA) module to address this issue.
The proposed SA module is efficient yet effective, e.g., the parameters and computations of SA against the backbone ResNet50 are 300 vs. 25.56M and 2.76e-3 GFLOPs vs. 4.12 GFLOPs, respectively.
arXiv Detail & Related papers (2021-01-30T15:23:17Z) - SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive
Connection [51.376723069962]
We present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection.
In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes.
We show that SAC is competitive with state-of-the-art models while significantly reducing memory cost.
arXiv Detail & Related papers (2020-03-22T07:58:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.