Energon: Towards Efficient Acceleration of Transformers Using Dynamic
Sparse Attention
- URL: http://arxiv.org/abs/2110.09310v1
- Date: Mon, 18 Oct 2021 13:42:43 GMT
- Title: Energon: Towards Efficient Acceleration of Transformers Using Dynamic
Sparse Attention
- Authors: Zhe Zhou and Junlin Liu and Zhenyu Gu and Guangyu Sun
- Abstract summary: transformer models have revolutionized Natural Language Processing (NLP) and also show promising performance on Computer Vision (CV) tasks.
We propose Energon, an algorithm-architecture co-design approach that accelerates various transformers using dynamic sparse attention.
We demonstrate that Energon achieves $161times$ and $8.4times$ geo-mean speedup and up to $104times$ and $103times$ energy reduction.
- Score: 5.495006023171481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, transformer models have revolutionized Natural Language
Processing (NLP) and also show promising performance on Computer Vision (CV)
tasks. Despite their effectiveness, transformers' attention operations are hard
to accelerate due to complicated data movement and quadratic computational
complexity, prohibiting the real-time inference on resource-constrained
edge-computing platforms.
To tackle this challenge, we propose Energon, an algorithm-architecture
co-design approach that accelerates various transformers using dynamic sparse
attention. With the observation that attention results only depend on a few
important query-key pairs, we propose a multi-round filtering algorithm to
dynamically identify such pairs at runtime. We adopt low bitwidth in each
filtering round and only use high-precision tensors in the attention stage to
reduce overall complexity. By this means, we significantly mitigate the
computational cost with negligible accuracy loss. To enable such an algorithm
with lower latency and better energy-efficiency, we also propose an Energon
co-processor architecture. Elaborated pipelines and specialized optimizations
jointly boost the performance and reduce power consumption. Extensive
experiments on both NLP and CV benchmarks demonstrate that Energon achieves
$161\times$ and $8.4\times$ geo-mean speedup and up to $10^4\times$ and
$10^3\times$ energy reduction compared with Intel Xeon 5220 CPU and NVIDIA V100
GPU. Compared to state-of-the-art attention accelerators SpAtten and $A^3$,
Energon also achieves $1.7\times, 1.25\times$ speedup and $1.6 \times,
1.5\times $ higher energy efficiency.
Related papers
- P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer [8.22044535304182]
Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive.
To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors.
We propose emphP$2$-ViT, the first underlinePower-of-Two (PoT) underlinepost-training quantization and acceleration framework.
arXiv Detail & Related papers (2024-05-30T10:26:36Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with
Transformers [6.0093441900032465]
Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing.
Previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization.
We propose a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead.
arXiv Detail & Related papers (2023-02-28T16:17:23Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and
Accelerator Co-Design [42.46121663652989]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks.
However, ViTs' self-attention module is still arguably a major bottleneck.
We propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs.
arXiv Detail & Related papers (2022-10-18T04:07:23Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA
Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration.
We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm.
Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z) - Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications.
We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.