Related papers: SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers

SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers

URL: http://arxiv.org/abs/2304.03986v2
Date: Tue, 25 Apr 2023 10:29:58 GMT
Title: SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers
Authors: Alberto Marchisio and Davide Dura and Maurizio Capra and Maurizio Martina and Guido Masera and Muhammad Shafique
Abstract summary: Quantized Transformers' compute-intensive operations pose enormous challenges for their deployment in resource-constrained EdgeAI / tinyML devices. We propose SwiftTron, an efficient specialized hardware accelerator designed for Quantized Transformers. Our Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64 mW power, and occupying an area of 273 mm2.
Score: 11.631442682756203
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers' compute-intensive operations pose enormous challenges for their deployment in resource-constrained EdgeAI / tinyML devices. As an established neural network compression technique, quantization reduces the hardware computational and memory resources. In particular, fixed-point quantization is desirable to ease the computations using lightweight blocks, like adders and multipliers, of the underlying hardware. However, deploying fully-quantized Transformers on existing general-purpose hardware, generic AI accelerators, or specialized architectures for Transformers with floating-point units might be infeasible and/or inefficient. Towards this, we propose SwiftTron, an efficient specialized hardware accelerator designed for Quantized Transformers. SwiftTron supports the execution of different types of Transformers' operations (like Attention, Softmax, GELU, and Layer Normalization) and accounts for diverse scaling factors to perform correct computations. We synthesize the complete SwiftTron architecture in a $65$ nm CMOS technology with the ASIC design flow. Our Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64 mW power, and occupying an area of 273 mm^2. To ease the reproducibility, the RTL of our SwiftTron architecture is released at https://github.com/albertomarchisio/SwiftTron.

Related papers

MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z)
SparseSwin: Swin Transformer with Sparse Transformer Block [1.7243216387069678]
This paper aims to reduce the number of parameters and in turn, made the transformer more efficient. We present Sparse Transformer (SparTa) Block, a modified transformer block with an addition of a sparse token converter. The proposed SparseSwin model outperforms other state of the art models in image classification with an accuracy of 86.96%, 97.43%, and 85.35% on the ImageNet100, CIFAR10, and CIFAR100 datasets respectively.
arXiv Detail & Related papers (2023-09-11T04:03:43Z)
AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers [6.0093441900032465]
Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing. Previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization. We propose a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead.
arXiv Detail & Related papers (2023-02-28T16:17:23Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs [6.9136984255301]
We present ByteTransformer, a high-performance transformer boosted for variable-length inputs. ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
arXiv Detail & Related papers (2022-10-06T16:57:23Z)
An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers. Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z)
Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence. Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z)
Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z)
Transformer Acceleration with Dynamic Sparse Attention [20.758709319088865]
We propose the Dynamic Sparse Attention (DSA) that can efficiently exploit the dynamic sparsity in the attention of Transformers. Our approach can achieve better trade-offs between accuracy and model complexity.
arXiv Detail & Related papers (2021-10-21T17:31:57Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing [78.48577649266018]
Hardware-Aware Transformers (HAT) are designed to enable low-latency inference on resource-constrained hardware platforms. We train a $textitSuperTransformer$ that covers all candidates in the design space, and efficiently produces many $textitSubTransformers$ with weight sharing. Experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware.
arXiv Detail & Related papers (2020-05-28T17:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.