P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer
- URL: http://arxiv.org/abs/2405.19915v1
- Date: Thu, 30 May 2024 10:26:36 GMT
- Title: P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer
- Authors: Huihong Shi, Xin Cheng, Wendong Mao, Zhongfeng Wang,
- Abstract summary: Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive.
To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors.
We propose emphP$2$-ViT, the first underlinePower-of-Two (PoT) underlinepost-training quantization and acceleration framework.
- Score: 8.22044535304182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield non-negligible re-quantization overhead, limiting ViTs' hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose \emph{P$^2$-ViT}, the first \underline{P}ower-of-Two (PoT) \underline{p}ost-training quantization and acceleration framework to accelerate fully quantized ViTs. Specifically, {as for quantization,} we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the re-quantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency trade-offs. {In terms of hardware,} we develop {a dedicated chunk-based accelerator} featuring multiple tailored sub-processors to individually handle ViTs' different types of operations, alleviating reconfigurable overhead. Additionally, we design {a tailored row-stationary dataflow} to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P$^2$-ViT's effectiveness. {Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. Besides, we achieve up to $\mathbf{10.1\times}$ speedup and $\mathbf{36.8\times}$ energy saving over GPU's Turing Tensor Cores, and up to $\mathbf{1.84\times}$ higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \url{https://github.com/shihuihong214/P2-ViT}.
Related papers
- M$^2$-ViT: Accelerating Hybrid Vision Transformers with Two-Level Mixed Quantization [3.9784270129141377]
We present M$2$-ViT to accelerate Convolution-Transformer hybrid efficient ViTs with two-level mixed quantization.
Specifically, we introduce a hardware-friendly two-level mixed quantization (M$2$Q) strategy, characterized by both mixed quantization precision and mixed quantization schemes.
arXiv Detail & Related papers (2024-10-10T11:16:57Z) - Accelerating PoT Quantization on Edge Devices [0.9558392439655012]
Non-uniform quantization, such as power-of-two (PoT) quantization, matches data distributions better than uniform quantization.
Existing pipelines for accelerating PoT-quantized Deep Neural Networks on edge devices are not open-source.
We propose PoTAcc, an open-source pipeline for end-to-end acceleration of PoT-quantized DNNs on resource-constrained edge devices.
arXiv Detail & Related papers (2024-09-30T15:33:47Z) - Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer [5.141764719319689]
Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks.
However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization.
We propose Trio-ViT, which eliminates the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly.
arXiv Detail & Related papers (2024-05-06T21:57:35Z) - I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of
Post-Training ViTs Quantization [63.07712842509526]
We introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion.
I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
arXiv Detail & Related papers (2023-11-16T13:07:47Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer [6.473688838974095]
We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs.
Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
arXiv Detail & Related papers (2023-06-10T13:53:41Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision
Transformers [35.92244135055901]
HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs.
HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies.
HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
arXiv Detail & Related papers (2022-11-15T13:00:43Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.