TerViT: An Efficient Ternary Vision Transformer
- URL: http://arxiv.org/abs/2201.08050v2
- Date: Fri, 21 Jan 2022 05:22:32 GMT
- Title: TerViT: An Efficient Ternary Vision Transformer
- Authors: Sheng Xu, Yanjing Li, Teli Ma, Bohan Zeng, Baochang Zhang, Peng Gao
and Jinhu Lv
- Abstract summary: Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices.
We introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters.
- Score: 21.348788407233265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViTs) have demonstrated great potential in various
visual tasks, but suffer from expensive computational and memory cost problems
when deployed on resource-constrained devices. In this paper, we introduce a
ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are
challenged by the large loss surface gap between real-valued and ternary
parameters. To address the issue, we introduce a progressive training scheme by
first training 8-bit transformers and then TerViT, and achieve a better
optimization than conventional methods. Furthermore, we introduce channel-wise
ternarization, by partitioning each matrix to different channels, each of which
is with an unique distribution and ternarization interval. We apply our methods
to popular DeiT and Swin backbones, and extensive results show that we can
achieve competitive performance. For example, TerViT can quantize Swin-S to
13.1MB model size while achieving above 79% Top-1 accuracy on ImageNet dataset.
Related papers
- PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - AdaptFormer: Adapting Vision Transformers for Scalable Visual
Recognition [39.443380221227166]
We propose an effective adaptation approach for Transformer, namely AdaptFormer.
It can adapt the pre-trained ViTs into many different image and video tasks efficiently.
It is able to increase the ViT's transferability without updating its original pre-trained parameters.
arXiv Detail & Related papers (2022-05-26T17:56:15Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - TRT-ViT: TensorRT-oriented Vision Transformer [19.173764508139016]
A family ofRT-oriented Transformers is presented, abbreviated as TRT-ViT.
Extensive experiments demonstrate that TRT-ViT significantly outperforms existing ConvNets and vision Transformers.
arXiv Detail & Related papers (2022-05-19T14:20:25Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.