Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer
- URL: http://arxiv.org/abs/2405.03882v3
- Date: Mon, 30 Sep 2024 07:01:57 GMT
- Title: Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer
- Authors: Huihong Shi, Haikuo Shao, Wendong Mao, Zhongfeng Wang,
- Abstract summary: Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks.
However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization.
We propose Trio-ViT, which eliminates the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly.
- Score: 5.141764719319689
- License:
- Abstract: Motivated by the huge success of Transformers in the field of natural language processing (NLP), Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly. Specifically, at the algorithm level, we develop a {tailored post-training quantization engine} taking the unique activation distributions of Softmax-free efficient ViTs into full consideration, aiming to boost quantization accuracy. Furthermore, at the hardware level, we build an accelerator dedicated to the specific Convolution-Transformer hybrid architecture of efficient ViTs, thereby enhancing hardware efficiency. Extensive experimental results consistently prove the effectiveness of our Trio-ViT framework. {Particularly, we can gain up to $\uparrow$$\mathbf{3.6}\times$, $\uparrow$$\mathbf{5.0}\times$, and $\uparrow$$\mathbf{7.3}\times$ FPS under comparable accuracy over state-of-the-art ViT accelerators, as well as $\uparrow$$\mathbf{6.0}\times$, $\uparrow$$\mathbf{1.5}\times$, and $\uparrow$$\mathbf{2.1}\times$ DSP efficiency.} Codes are available at \url{https://github.com/shihuihong214/Trio-ViT}.
Related papers
- M$^2$-ViT: Accelerating Hybrid Vision Transformers with Two-Level Mixed Quantization [3.9784270129141377]
We present M$2$-ViT to accelerate Convolution-Transformer hybrid efficient ViTs with two-level mixed quantization.
Specifically, we introduce a hardware-friendly two-level mixed quantization (M$2$Q) strategy, characterized by both mixed quantization precision and mixed quantization schemes.
arXiv Detail & Related papers (2024-10-10T11:16:57Z) - P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer [8.22044535304182]
Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive.
To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors.
We propose emphP$2$-ViT, the first underlinePower-of-Two (PoT) underlinepost-training quantization and acceleration framework.
arXiv Detail & Related papers (2024-05-30T10:26:36Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer [6.473688838974095]
We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs.
Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
arXiv Detail & Related papers (2023-06-10T13:53:41Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision
Transformers [35.92244135055901]
HeatViT is an image-adaptive token pruning framework for vision transformers (ViTs) on embedded FPGAs.
HeatViT can achieve 0.7%$sim$8.9% higher accuracy compared to existing ViT pruning studies.
HeatViT can achieve more than 28.4%$sim computation reduction, for various widely used ViTs.
arXiv Detail & Related papers (2022-11-15T13:00:43Z) - ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and
Accelerator Co-Design [42.46121663652989]
Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks.
However, ViTs' self-attention module is still arguably a major bottleneck.
We propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs.
arXiv Detail & Related papers (2022-10-18T04:07:23Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - HAT: Hardware-Aware Transformers for Efficient Natural Language
Processing [78.48577649266018]
Hardware-Aware Transformers (HAT) are designed to enable low-latency inference on resource-constrained hardware platforms.
We train a $textitSuperTransformer$ that covers all candidates in the design space, and efficiently produces many $textitSubTransformers$ with weight sharing.
Experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware.
arXiv Detail & Related papers (2020-05-28T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.