MiniViT: Compressing Vision Transformers with Weight Multiplexing
- URL: http://arxiv.org/abs/2204.07154v1
- Date: Thu, 14 Apr 2022 17:59:05 GMT
- Title: MiniViT: Compressing Vision Transformers with Weight Multiplexing
- Authors: Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong
Fu, Lu Yuan
- Abstract summary: Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability.
MiniViT is a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance.
- Score: 88.54212027516755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) models have recently drawn much attention in
computer vision due to their high model capability. However, ViT models suffer
from huge number of parameters, restricting their applicability on devices with
limited memory. To alleviate this problem, we propose MiniViT, a new
compression framework, which achieves parameter reduction in vision
transformers while retaining the same performance. The central idea of MiniViT
is to multiplex the weights of consecutive transformer blocks. More
specifically, we make the weights shared across layers, while imposing a
transformation on the weights to increase diversity. Weight distillation over
self-attention is also applied to transfer knowledge from large-scale ViT
models to weight-multiplexed compact models. Comprehensive experiments
demonstrate the efficacy of MiniViT, showing that it can reduce the size of the
pre-trained Swin-B transformer by 48\%, while achieving an increase of 1.0\% in
Top-1 accuracy on ImageNet. Moreover, using a single-layer of parameters,
MiniViT is able to compress DeiT-B by 9.7 times from 86M to 9M parameters,
without seriously compromising the performance. Finally, we verify the
transferability of MiniViT by reporting its performance on downstream
benchmarks. Code and models are available at here.
Related papers
- Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - TinyViT: Fast Pretraining Distillation for Small Vision Transformers [88.54212027516755]
We propose TinyViT, a new family of tiny and efficient small vision transformers pretrained on large-scale datasets.
The central idea is to transfer knowledge from large pretrained models to small ones, while enabling small models to get the dividends of massive pretraining data.
arXiv Detail & Related papers (2022-07-21T17:59:56Z) - Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - TerViT: An Efficient Ternary Vision Transformer [21.348788407233265]
Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices.
We introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters.
arXiv Detail & Related papers (2022-01-20T08:29:19Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.