Bi-ViT: Pushing the Limit of Vision Transformer Quantization
- URL: http://arxiv.org/abs/2305.12354v1
- Date: Sun, 21 May 2023 05:24:43 GMT
- Title: Bi-ViT: Pushing the Limit of Vision Transformer Quantization
- Authors: Yanjing Li, Sheng Xu, Mingbao Lin, Xianbin Cao, Chuanjian Liu, Xiao
Sun, Baochang Zhang
- Abstract summary: Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices.
We introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses.
We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework.
- Score: 38.24456467950003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViTs) quantization offers a promising prospect to
facilitate deploying large pre-trained networks on resource-limited devices.
Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit
remain largely unexplored and a very challenging task yet, due to their
unacceptable performance. Through extensive empirical analyses, we identify the
severe drop in ViT binarization is caused by attention distortion in
self-attention, which technically stems from the gradient vanishing and ranking
disorder. To address these issues, we first introduce a learnable scaling
factor to reactivate the vanished gradients and illustrate its effectiveness
through theoretical and experimental analyses. We then propose a ranking-aware
distillation method to rectify the disordered ranking in a teacher-student
framework. Bi-ViT achieves significant improvements over popular DeiT and Swin
backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and
Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4%
respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs
compared with real-valued counterparts on ImageNet.
Related papers
- MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision
Transformer [7.041718444626999]
We propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT)
Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset.
arXiv Detail & Related papers (2024-01-26T14:25:15Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - BinaryViT: Towards Efficient and Accurate Binary Vision Transformers [4.339315098369913]
Vision Transformers (ViTs) have emerged as the fundamental architecture for most computer vision fields.
As one of the most powerful compression methods, binarization reduces the computation of the neural network by quantizing the weights and activation values as $pm$1.
Existing binarization methods have demonstrated excellent performance on CNNs, but the full binarization of ViTs is still under-studied and suffering a significant performance drop.
arXiv Detail & Related papers (2023-05-24T05:06:59Z) - Towards Accurate Post-Training Quantization for Vision Transformer [48.779346466374406]
Existing post-training quantization methods still cause severe performance drops.
APQ-ViT surpasses the existing post-training quantization methods by convincing margins.
arXiv Detail & Related papers (2023-03-25T03:05:26Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Semi-supervised Vision Transformers at Scale [93.0621675558895]
We study semi-supervised learning (SSL) for vision transformers (ViT)
We propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning.
Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting.
arXiv Detail & Related papers (2022-08-11T08:11:54Z) - The Principle of Diversity: Training Stronger Vision Transformers Calls
for Reducing All Levels of Redundancy [111.49944789602884]
This paper systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space.
We propose corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.
arXiv Detail & Related papers (2022-03-12T04:48:12Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.