SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers
- URL: http://arxiv.org/abs/2304.03986v2
- Date: Tue, 25 Apr 2023 10:29:58 GMT
- Title: SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers
- Authors: Alberto Marchisio and Davide Dura and Maurizio Capra and Maurizio
Martina and Guido Masera and Muhammad Shafique
- Abstract summary: Quantized Transformers' compute-intensive operations pose enormous challenges for their deployment in resource-constrained EdgeAI / tinyML devices.
We propose SwiftTron, an efficient specialized hardware accelerator designed for Quantized Transformers.
Our Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64 mW power, and occupying an area of 273 mm2.
- Score: 11.631442682756203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers' compute-intensive operations pose enormous challenges for their
deployment in resource-constrained EdgeAI / tinyML devices. As an established
neural network compression technique, quantization reduces the hardware
computational and memory resources. In particular, fixed-point quantization is
desirable to ease the computations using lightweight blocks, like adders and
multipliers, of the underlying hardware. However, deploying fully-quantized
Transformers on existing general-purpose hardware, generic AI accelerators, or
specialized architectures for Transformers with floating-point units might be
infeasible and/or inefficient.
Towards this, we propose SwiftTron, an efficient specialized hardware
accelerator designed for Quantized Transformers. SwiftTron supports the
execution of different types of Transformers' operations (like Attention,
Softmax, GELU, and Layer Normalization) and accounts for diverse scaling
factors to perform correct computations. We synthesize the complete SwiftTron
architecture in a $65$ nm CMOS technology with the ASIC design flow. Our
Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64
mW power, and occupying an area of 273 mm^2. To ease the reproducibility, the
RTL of our SwiftTron architecture is released at
https://github.com/albertomarchisio/SwiftTron.
Related papers
- MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window.
We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z) - SparseSwin: Swin Transformer with Sparse Transformer Block [1.7243216387069678]
This paper aims to reduce the number of parameters and in turn, made the transformer more efficient.
We present Sparse Transformer (SparTa) Block, a modified transformer block with an addition of a sparse token converter.
The proposed SparseSwin model outperforms other state of the art models in image classification with an accuracy of 86.96%, 97.43%, and 85.35% on the ImageNet100, CIFAR10, and CIFAR100 datasets respectively.
arXiv Detail & Related papers (2023-09-11T04:03:43Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - ByteTransformer: A High-Performance Transformer Boosted for
Variable-Length Inputs [6.9136984255301]
We present ByteTransformer, a high-performance transformer boosted for variable-length inputs.
ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
arXiv Detail & Related papers (2022-10-06T16:57:23Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence.
Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Transformer Acceleration with Dynamic Sparse Attention [20.758709319088865]
We propose the Dynamic Sparse Attention (DSA) that can efficiently exploit the dynamic sparsity in the attention of Transformers.
Our approach can achieve better trade-offs between accuracy and model complexity.
arXiv Detail & Related papers (2021-10-21T17:31:57Z) - On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.