I-ViT: Integer-only Quantization for Efficient Vision Transformer
Inference
- URL: http://arxiv.org/abs/2207.01405v4
- Date: Mon, 7 Aug 2023 03:11:49 GMT
- Title: I-ViT: Integer-only Quantization for Efficient Vision Transformer
Inference
- Authors: Zhikai Li and Qingyi Gu
- Abstract summary: Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications.
These models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging.
We propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting.
- Score: 3.067607520161916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance on
various computer vision applications. However, these models have considerable
storage and computational overheads, making their deployment and efficient
inference on edge devices challenging. Quantization is a promising approach to
reducing model complexity, and the dyadic arithmetic pipeline can allow the
quantized models to perform efficient integer-only inference. Unfortunately,
dyadic arithmetic is based on the homogeneity condition in convolutional neural
networks, which is not applicable to the non-linear components in ViTs, making
integer-only inference of ViTs an open issue. In this paper, we propose I-ViT,
an integer-only quantization scheme for ViTs, to enable ViTs to perform the
entire computational graph of inference with integer arithmetic and
bit-shifting, and without any floating-point arithmetic. In I-ViT, linear
operations (e.g., MatMul and Dense) follow the integer-only pipeline with
dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and
LayerNorm) are approximated by the proposed light-weight integer-only
arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and
ShiftGELU, which are designed to use integer bit-shifting to approximate the
corresponding floating-point operations. We evaluate I-ViT on various benchmark
models and the results show that integer-only INT8 quantization achieves
comparable (or even slightly higher) accuracy to the full-precision (FP)
baseline. Furthermore, we utilize TVM for practical hardware deployment on the
GPU's integer arithmetic units, achieving 3.72$\sim$4.11$\times$ inference
speedup compared to the FP model. Code of both Pytorch and TVM is released at
https://github.com/zkkli/I-ViT.
Related papers
- PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer [6.473688838974095]
We propose a new type of multiplication-reduced model, dubbed $textbfShiftAddViT$, to achieve end-to-end inference speedups on GPUs.
Experiments on various 2D/3D vision tasks consistently validate the effectiveness of our proposed ShiftAddViT.
arXiv Detail & Related papers (2023-06-10T13:53:41Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Integer Fine-tuning of Transformer-based Models [13.383066080742699]
We study the effect of various integer bit-widths to find the minimum required bit-width for integer fine-tuning of transformer-based models.
We show that 16-bit integer models match the floating-point baseline performance.
Further reduction of the bit-width to 8 provides an average score drop of 1.7 points.
arXiv Detail & Related papers (2022-09-20T16:02:28Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - Is Integer Arithmetic Enough for Deep Learning Training? [2.9136421025415205]
replacing floating-point arithmetic with low-bit integer arithmetic is a promising approach to save energy, memory footprint, and latency of deep learning models.
We propose a fully functional integer training pipeline including forward pass, back-propagation, and gradient descent.
Our experimental results show that our proposed method is effective in a wide variety of tasks such as classification (including vision transformers), object detection, and semantic segmentation.
arXiv Detail & Related papers (2022-07-18T22:36:57Z) - I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models.
I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation.
We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z) - NITI: Training Integer Neural Networks Using Integer-only Arithmetic [4.361357921751159]
We present NITI, an efficient deep neural network training framework that computes exclusively with integer arithmetic.
A proof-of-concept open-source software implementation of NITI that utilizes native 8-bit integer operations is presented.
NITI achieves negligible accuracy degradation on the MNIST and CIFAR10 datasets using 8-bit integer storage and computation.
arXiv Detail & Related papers (2020-09-28T07:41:36Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.