Related papers: TerViT: An Efficient Ternary Vision Transformer

TerViT: An Efficient Ternary Vision Transformer

URL: http://arxiv.org/abs/2201.08050v2
Date: Fri, 21 Jan 2022 05:22:32 GMT
Title: TerViT: An Efficient Ternary Vision Transformer
Authors: Sheng Xu, Yanjing Li, Teli Ma, Bohan Zeng, Baochang Zhang, Peng Gao and Jinhu Lv
Abstract summary: Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. We introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters.
Score: 21.348788407233265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. In this paper, we introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters. To address the issue, we introduce a progressive training scheme by first training 8-bit transformers and then TerViT, and achieve a better optimization than conventional methods. Furthermore, we introduce channel-wise ternarization, by partitioning each matrix to different channels, each of which is with an unique distribution and ternarization interval. We apply our methods to popular DeiT and Swin backbones, and extensive results show that we can achieve competitive performance. For example, TerViT can quantize Swin-S to 13.1MB model size while achieving above 79% Top-1 accuracy on ImageNet dataset.

Related papers

PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations. We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z)
Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT) Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z)
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition [39.443380221227166]
We propose an effective adaptation approach for Transformer, namely AdaptFormer. It can adapt the pre-trained ViTs into many different image and video tasks efficiently. It is able to increase the ViT's transferability without updating its original pre-trained parameters.
arXiv Detail & Related papers (2022-05-26T17:56:15Z)
Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z)
TRT-ViT: TensorRT-oriented Vision Transformer [19.173764508139016]
A family ofRT-oriented Transformers is presented, abbreviated as TRT-ViT. Extensive experiments demonstrate that TRT-ViT significantly outperforms existing ConvNets and vision Transformers.
arXiv Detail & Related papers (2022-05-19T14:20:25Z)
A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
ConvNets vs. Transformers: Whose Visual Representations are More Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations. We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z)
Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions. When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency. New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.