Sparse Attention Acceleration with Synergistic In-Memory Pruning and
On-Chip Recomputation
- URL: http://arxiv.org/abs/2209.00606v1
- Date: Thu, 1 Sep 2022 17:18:19 GMT
- Title: Sparse Attention Acceleration with Synergistic In-Memory Pruning and
On-Chip Recomputation
- Authors: Amir Yazdanbakhsh, Ashkan Moradifirouzabadi, Zheng Li, Mingu Kang
- Abstract summary: Self-attention mechanism gauges pairwise correlations across entire input sequence.
Despite favorable performance, calculating pairwise correlations is prohibitively costly.
This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
- Score: 6.303594714446706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As its core computation, a self-attention mechanism gauges pairwise
correlations across the entire input sequence. Despite favorable performance,
calculating pairwise correlations is prohibitively costly. While recent work
has shown the benefits of runtime pruning of elements with low attention
scores, the quadratic complexity of self-attention mechanisms and their on-chip
memory capacity demands are overlooked. This work addresses these constraints
by architecting an accelerator, called SPRINT, which leverages the inherent
parallelism of ReRAM crossbar arrays to compute attention scores in an
approximate manner. Our design prunes the low attention scores using a
lightweight analog thresholding circuitry within ReRAM, enabling SPRINT to
fetch only a small subset of relevant data to on-chip memory. To mitigate
potential negative repercussions for model accuracy, SPRINT re-computes the
attention scores for the few fetched data in digital. The combined in-memory
pruning and on-chip recompute of the relevant attention scores enables SPRINT
to transform quadratic complexity to a merely linear one. In addition, we
identify and leverage a dynamic spatial locality between the adjacent attention
operations even after pruning, which eliminates costly yet redundant data
fetches. We evaluate our proposed technique on a wide range of state-of-the-art
transformer models. On average, SPRINT yields 7.5x speedup and 19.6x energy
reduction when total 16KB on-chip memory is used, while virtually on par with
iso-accuracy of the baseline models (on average 0.36% degradation).
Related papers
- ELASTIC: Efficient Linear Attention for Sequential Interest Compression [5.689306819772134]
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism.
We propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression.
We conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders.
arXiv Detail & Related papers (2024-08-18T06:41:46Z) - Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference
of Deep Learning [5.41530201129053]
This paper proposes a novel memorization-based inference (MBI) that is compute free and only requires lookups.
Specifically, our work capitalizes on the inference mechanism of the recurrent attention model (RAM)
By leveraging the low-dimensionality of glimpse, our inference procedure stores key value pairs comprising of glimpse location, patch vector, etc. in a table.
The computations are obviated during inference by utilizing the table to read out key-value pairs and performing compute-free inference by memorization.
arXiv Detail & Related papers (2023-07-14T21:01:59Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches.
We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Channelized Axial Attention for Semantic Segmentation [70.14921019774793]
We propose the Channelized Axial Attention (CAA) to seamlessly integratechannel attention and axial attention with reduced computationalcomplexity.
Our CAA not onlyrequires much less computation resources compared with otherdual attention models such as DANet, but also outperforms the state-of-the-art ResNet-101-based segmentation models on alltested datasets.
arXiv Detail & Related papers (2021-01-19T03:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.