ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and
Accelerator Co-Design
- URL: http://arxiv.org/abs/2210.09573v1
- Date: Tue, 18 Oct 2022 04:07:23 GMT
- Title: ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and
Accelerator Co-Design
- Authors: Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan
Zhang, Chaojian Li, Baopu Li, Yingyan Lin
- Abstract summary: Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks.
However, ViTs' self-attention module is still arguably a major bottleneck.
We propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs.
- Score: 42.46121663652989
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance on
various vision tasks. However, ViTs' self-attention module is still arguably a
major bottleneck, limiting their achievable hardware efficiency. Meanwhile,
existing accelerators dedicated to NLP Transformers are not optimal for ViTs.
This is because there is a large difference between ViTs and NLP Transformers:
ViTs have a relatively fixed number of input tokens, whose attention maps can
be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers
need to handle input sequences of varying numbers of tokens and rely on
on-the-fly predictions of dynamic sparse attention patterns for each input to
achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated
algorithm and accelerator co-design framework dubbed ViTCoD for accelerating
ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the
attention maps to have either denser or sparser fixed patterns for regularizing
two levels of workloads without hurting the accuracy, largely reducing the
attention computations while leaving room for alleviating the remaining
dominant data movements; on top of that, we further integrate a lightweight and
learnable auto-encoder module to enable trading the dominant high-cost data
movements for lower-cost computations. On the hardware level, we develop a
dedicated accelerator to simultaneously coordinate the enforced denser/sparser
workloads and encoder/decoder engines for boosted hardware utilization.
Extensive experiments and ablation studies validate that ViTCoD largely reduces
the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x,
86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs,
and prior-art Transformer accelerators SpAtten and Sanger under an attention
sparsity of 90%, respectively.
Related papers
- An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT [5.141764719319689]
We propose an FPGA-based accelerator for EfficientViT to advance the hardware efficiency frontier of ViTs.
Specifically, we design a reconfigurable architecture to efficiently support various operation types, including lightweight convolutions and attention.
Experimental results show that our accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency at 200MHz.
arXiv Detail & Related papers (2024-03-29T15:20:33Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - ViTA: A Vision Transformer Inference Accelerator for Edge Applications [4.3469216446051995]
Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks.
They are compute-heavy and difficult to deploy in resource-constrained edge devices.
We propose ViTA - a hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices.
arXiv Detail & Related papers (2023-02-17T19:35:36Z) - HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision
Transformers [35.92244135055901]
HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs.
HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies.
HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
arXiv Detail & Related papers (2022-11-15T13:00:43Z) - ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision
Transformer Acceleration with a Linear Taylor Attention [23.874485033096917]
Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications.
We propose a first-of-its-kind algorithm- hardware codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of ViTs.
ViTALiTy unifies both low-rank and sparse components of the attention in ViTs.
arXiv Detail & Related papers (2022-11-09T18:58:21Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse
Transformers [11.811907838840712]
We propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns.
We present a flexible and efficient hardware architecture, namely STA, to achieve significant speedup when deploying N:M sparse Transformers.
Experimental results show that compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency.
arXiv Detail & Related papers (2022-08-12T04:51:49Z) - Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision
Transformer with Mixed-Scheme Quantization [78.18328503396057]
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks.
This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization.
arXiv Detail & Related papers (2022-08-10T05:54:46Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.