Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
- URL: http://arxiv.org/abs/2210.06707v1
- Date: Thu, 13 Oct 2022 04:00:29 GMT
- Title: Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
- Authors: Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, Guodong
Guo
- Abstract summary: We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
- Score: 56.87383229709899
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The large pre-trained vision transformers (ViTs) have demonstrated remarkable
performance on various visual tasks, but suffer from expensive computational
and memory cost problems when deployed on resource-constrained devices. Among
the powerful compression approaches, quantization extremely reduces the
computation and memory consumption by low-bit parameters and bit-wise
operations. However, low-bit ViTs remain largely unexplored and usually suffer
from a significant performance drop compared with the real-valued counterparts.
In this work, through extensive empirical analysis, we first identify the
bottleneck for severe performance drop comes from the information distortion of
the low-bit quantized self-attention map. We then develop an information
rectification module (IRM) and a distribution guided distillation (DGD) scheme
for fully quantized vision transformers (Q-ViT) to effectively eliminate such
distortion, leading to a fully quantized ViTs. We evaluate our methods on
popular DeiT and Swin backbones. Extensive experimental results show that our
method achieves a much better performance than the prior arts. For example, our
Q-ViT can theoretically accelerates the ViT-S by 6.14x and achieves about 80.9%
Top-1 accuracy, even surpassing the full-precision counterpart by 1.0% on
ImageNet dataset. Our codes and models are attached on
https://github.com/YanjingLi0202/Q-ViT
Related papers
- An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision
Transformer [7.041718444626999]
We propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT)
Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset.
arXiv Detail & Related papers (2024-01-26T14:25:15Z) - BinaryViT: Towards Efficient and Accurate Binary Vision Transformers [4.339315098369913]
Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields.
As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as $pm$1.
Existing binarization methods have demonstrated excellent performance on CNNs, but the full binarization of ViTs is still under-studied and suffering a significant performance drop.
arXiv Detail & Related papers (2023-05-24T05:06:59Z) - Bi-ViT: Pushing the Limit of Vision Transformer Quantization [38.24456467950003]
Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices.
We introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses.
We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework.
arXiv Detail & Related papers (2023-05-21T05:24:43Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - TerViT: An Efficient Ternary Vision Transformer [21.348788407233265]
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices.
We introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters.
arXiv Detail & Related papers (2022-01-20T08:29:19Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.