An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers
- URL: http://arxiv.org/abs/2208.06118v1
- Date: Fri, 12 Aug 2022 04:51:49 GMT
- Title: An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers
- Authors: Chao Fang, Aojun Zhou, Zhongfeng Wang
- Abstract summary: We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
- Score: 11.811907838840712
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Transformer has been an indispensable staple in deep learning. However,
for real-life applications, it is very challenging to deploy efficient
Transformers due to immense parameters and operations of models. To relieve
this burden, exploiting sparsity is an effective approach to accelerate
Transformers. Newly emerging Ampere GPUs leverage a 2:4 sparsity pattern to
achieve model acceleration, while it can hardly meet the diverse algorithm and
hardware constraints when deploying models. By contrast, we propose an
algorithm-hardware co-optimized framework to flexibly and efficiently
accelerate Transformers by utilizing general N:M sparsity patterns. (1) From
algorithm perspective, we propose a sparsity inheritance mechanism along with
an inherited dynamic pruning (IDP) method to obtain a series of N:M sparse
candidate Transformers rapidly. A model compression scheme is further proposed
to significantly reduce the storage requirement for deployment. (2) From
hardware perspective, we present a flexible and efficient hardware
architecture, namely STA, to achieve significant speedup when deploying N:M
sparse Transformers. STA features not only a computing engine unifying both
sparse-dense and dense-dense matrix multiplications with high computational
efficiency but also a scalable softmax module eliminating the latency from
intermediate off-chip data communication. Experimental results show that
compared to other methods, N:M sparse Transformers, generated using IDP,
achieves an average of 6.7% improvement on accuracy with high training
efficiency. Moreover, STA can achieve 14.47x and 11.33x speedup compared to
Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and perform 2.00-19.47x
faster inference than the state-of-the-art FPGA-based accelerators for
Transformers.
Related papers
- Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment [3.391499691517567]
Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices.
We propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization.
Experimental results show our co-design achieves up to 2.14-49.37x throughput gains and 3.72-88.53x better energy efficiency over state-of-the-art Transformer accelerators.
arXiv Detail & Related papers (2024-07-16T12:36:10Z) - Accelerator-driven Data Arrangement to Minimize Transformers Run-time on
Multi-core Architectures [5.46396577345121]
complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption.
We propose a novel memory arrangement strategy, governed by the hardware accelerator's kernel size, which effectively minimizes off-chip data access.
Our approach can achieve up to a 2.8x speed increase when executing inferences employing state-of-the-art transformers.
arXiv Detail & Related papers (2023-12-20T13:01:25Z) - ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized
Transformers [13.177523799771635]
Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks.
The efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies.
We propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems.
arXiv Detail & Related papers (2023-07-07T10:05:38Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA
Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration.
We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm.
Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - FTRANS: Energy-Efficient Acceleration of Transformers using FPGA [11.032972017827248]
We propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations.
Our framework significantly reduces the model size of NLP models by up to 16 times.
Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.
arXiv Detail & Related papers (2020-07-16T18:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.