AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with
Transformers
- URL: http://arxiv.org/abs/2302.14705v2
- Date: Mon, 1 May 2023 16:21:21 GMT
- Title: AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with
Transformers
- Authors: Shikhar Tuli and Niraj K. Jha
- Abstract summary: Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing.
Previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization.
We propose a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead.
- Score: 6.0093441900032465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-attention-based transformer models have achieved tremendous success in
the domain of natural language processing. Despite their efficacy, accelerating
the transformer is challenging due to its quadratic computational complexity
and large activation sizes. Existing transformer accelerators attempt to prune
its tokens to reduce memory access, albeit with high compute overheads.
Moreover, previous works directly operate on large matrices involved in the
attention operation, which limits hardware utilization. In order to address
these challenges, this work proposes a novel dynamic inference scheme,
DynaTran, which prunes activations at runtime with low overhead, substantially
reducing the number of ineffectual operations. This improves the throughput of
transformer inference. We further propose tiling the matrices in transformer
operations along with diverse dataflows to improve data reuse, thus enabling
higher energy efficiency. To effectively implement these methods, we propose
AccelTran, a novel accelerator architecture for transformers. Extensive
experiments with different models and benchmarks demonstrate that DynaTran
achieves higher accuracy than the state-of-the-art top-k hardware-aware pruning
strategy while attaining up to 1.2$\times$ higher sparsity. One of our proposed
accelerators, AccelTran-Edge, achieves 330K$\times$ higher throughput with
93K$\times$ lower energy requirement when compared to a Raspberry Pi device. On
the other hand, AccelTran-Server achieves 5.73$\times$ higher throughput and
3.69$\times$ lower energy consumption compared to the state-of-the-art
transformer co-processor, Energon. The simulation source code is available at
https://github.com/jha-lab/acceltran.
Related papers
- Accelerator-driven Data Arrangement to Minimize Transformers Run-time on
Multi-core Architectures [5.46396577345121]
complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption.
We propose a novel memory arrangement strategy, governed by the hardware accelerator's kernel size, which effectively minimizes off-chip data access.
Our approach can achieve up to a 2.8x speed increase when executing inferences employing state-of-the-art transformers.
arXiv Detail & Related papers (2023-12-20T13:01:25Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - TransCODE: Co-design of Transformers and Accelerators for Efficient
Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators.
We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models.
The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - Energon: Towards Efficient Acceleration of Transformers Using Dynamic
Sparse Attention [5.495006023171481]
transformer models have revolutionized Natural Language Processing (NLP) and also show promising performance on Computer Vision (CV) tasks.
We propose Energon, an algorithm-architecture co-design approach that accelerates various transformers using dynamic sparse attention.
We demonstrate that Energon achieves $161times$ and $8.4times$ geo-mean speedup and up to $104times$ and $103times$ energy reduction.
arXiv Detail & Related papers (2021-10-18T13:42:43Z) - Augmented Shortcuts for Vision Transformers [49.70151144700589]
We study the relationship between shortcuts and feature diversity in vision transformer models.
We present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts.
Experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2021-06-30T09:48:30Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z) - Transformer on a Diet [81.09119185568296]
Transformer has been widely used thanks to its ability to capture sequence information in an efficient way.
Recent developments, such as BERT and GPT-2, deliver only heavy architectures with a focus on effectiveness.
We explore three carefully-designed light Transformer architectures to figure out whether the Transformer with less computations could produce competitive results.
arXiv Detail & Related papers (2020-02-14T18:41:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.