Vision Transformer Slimming: Multi-Dimension Searching in Continuous
Optimization Space
- URL: http://arxiv.org/abs/2201.00814v1
- Date: Mon, 3 Jan 2022 18:59:54 GMT
- Title: Vision Transformer Slimming: Multi-Dimension Searching in Continuous
Optimization Space
- Authors: Arnav Chavan and Zhiqiang Shen and Zhuang Liu and Zechun Liu and
Kwang-Ting Cheng and Eric Xing
- Abstract summary: We introduce a pure vision transformer slimming (ViT-Slim) framework that can search such a sub-structure across multiple dimensions.
Our method is based on a learnable and unified l1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions.
Our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by 0.6% on ImageNet.
- Score: 35.04846842178276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores the feasibility of finding an optimal sub-model from a
vision transformer and introduces a pure vision transformer slimming (ViT-Slim)
framework that can search such a sub-structure from the original model
end-to-end across multiple dimensions, including the input tokens, MHSA and MLP
modules with state-of-the-art performance. Our method is based on a learnable
and unified l1 sparsity constraint with pre-defined factors to reflect the
global importance in the continuous searching space of different dimensions.
The searching process is highly efficient through a single-shot training
scheme. For instance, on DeiT-S, ViT-Slim only takes ~43 GPU hours for
searching process, and the searched structure is flexible with diverse
dimensionalities in different modules. Then, a budget threshold is employed
according to the requirements of accuracy-FLOPs trade-off on running devices,
and a re-training process is performed to obtain the final models. The
extensive experiments show that our ViT-Slim can compress up to 40% of
parameters and 40% FLOPs on various vision transformers while increasing the
accuracy by ~0.6% on ImageNet. We also demonstrate the advantage of our
searched models on several downstream datasets. Our source code will be
publicly available.
Related papers
- ED-ViT: Splitting Vision Transformer for Distributed Inference on Edge Devices [13.533267828812455]
We propose a novel Vision Transformer splitting framework, ED-ViT, to execute complex models across multiple edge devices efficiently.
Specifically, we partition Vision Transformer models into several sub-models, where each sub-model is tailored to handle a specific subset of data classes.
We conduct extensive experiments on five datasets with three model structures, demonstrating that our approach significantly reduces inference latency on edge devices.
arXiv Detail & Related papers (2024-10-15T14:38:14Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models.
One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2.
Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.