HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision
Transformers
- URL: http://arxiv.org/abs/2211.08110v1
- Date: Tue, 15 Nov 2022 13:00:43 GMT
- Title: HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision
Transformers
- Authors: Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun
Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, Yanzhi Wang
- Abstract summary: HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs.
HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies.
HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
- Score: 35.92244135055901
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While vision transformers (ViTs) have continuously achieved new milestones in
the field of computer vision, their sophisticated network architectures with
high computation and memory costs have impeded their deployment on
resource-limited edge devices. In this paper, we propose a hardware-efficient
image-adaptive token pruning framework called HeatViT for efficient yet
accurate ViT acceleration on embedded FPGAs. By analyzing the inherent
computational patterns in ViTs, we first design an effective attention-based
multi-head token selector, which can be progressively inserted before
transformer blocks to dynamically identify and consolidate the non-informative
tokens from input images. Moreover, we implement the token selector on hardware
by adding miniature control logic to heavily reuse existing hardware components
built for the backbone ViT. To improve the hardware efficiency, we further
employ 8-bit fixed-point quantization, and propose polynomial approximations
with regularization effect on quantization error for the frequently used
nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage
training strategy to determine the transformer blocks for inserting token
selectors and optimize the desired (average) pruning rates for inserted token
selectors, in order to improve both the model accuracy and inference latency on
hardware. Compared to existing ViT pruning studies, under the similar
computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while
under the similar model accuracy, HeatViT can achieve more than
28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including
DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset.
Compared to the baseline hardware accelerator, our implementations of HeatViT
on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.
Related papers
- Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer [8.22044535304182]
Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive.
To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors.
We propose emphP$2$-ViT, the first underlinePower-of-Two (PoT) underlinepost-training quantization and acceleration framework.
arXiv Detail & Related papers (2024-05-30T10:26:36Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and
Accelerator Co-Design [42.46121663652989]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks.
However, ViTs' self-attention module is still arguably a major bottleneck.
We propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs.
arXiv Detail & Related papers (2022-10-18T04:07:23Z) - Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision
Transformer with Mixed-Scheme Quantization [78.18328503396057]
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks.
This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization.
arXiv Detail & Related papers (2022-08-10T05:54:46Z) - VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit
Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs)
Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations.
This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z) - SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [38.10083471492964]
Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures.
We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures.
Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-12-27T20:15:25Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.