Related papers: HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

URL: http://arxiv.org/abs/2211.08110v1
Date: Tue, 15 Nov 2022 13:00:43 GMT
Title: HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers
Authors: Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, Yanzhi Wang
Abstract summary: HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs. HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies. HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
Score: 35.92244135055901
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. By analyzing the inherent computational patterns in ViTs, we first design an effective attention-based multi-head token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization, and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage training strategy to determine the transformer blocks for inserting token selectors and optimize the desired (average) pruning rates for inserted token selectors, in order to improve both the model accuracy and inference latency on hardware. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.

Related papers

Your ViT is Secretly an Image Segmentation Model [50.71238842539735]
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. We show that inductive biases introduced by task-specific components can instead be learned by the ViT itself. We introduce the Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation.
arXiv Detail & Related papers (2025-03-24T19:56:02Z)
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs) This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z)
P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer [8.22044535304182]
Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors. We propose emphP$2$-ViT, the first underlinePower-of-Two (PoT) underlinepost-training quantization and acceleration framework.
arXiv Detail & Related papers (2024-05-30T10:26:36Z)
PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design [42.46121663652989]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck. We propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs.
arXiv Detail & Related papers (2022-10-18T04:07:23Z)
Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization [78.18328503396057]
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization.
arXiv Detail & Related papers (2022-08-10T05:54:46Z)
VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs) Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations. This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z)
SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [38.10083471492964]
Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures. Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-12-27T20:15:25Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.