Slicing Vision Transformer for Flexible Inference
- URL: http://arxiv.org/abs/2412.04786v1
- Date: Fri, 06 Dec 2024 05:31:42 GMT
- Title: Slicing Vision Transformer for Flexible Inference
- Authors: Yitian Zhang, Huseyin Coskun, Xu Ma, Huan Wang, Ke Ma, Xi, Chen, Derek Hao Hu, Yun Fu,
- Abstract summary: We propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs.
S Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters.
- Score: 79.35046907288518
- License:
- Abstract: Vision Transformers (ViT) is known for its scalability. In this work, we target to scale down a ViT to fit in an environment with dynamic-changing resource constraints. We observe that smaller ViTs are intrinsically the sub-networks of a larger ViT with different widths. Thus, we propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs with flexible inference capability, which aligns with the inherent design of ViT to vary from widths. Concretely, Scala activates several subnets during training, introduces Isolated Activation to disentangle the smallest sub-network from other subnets, and leverages Scale Coordination to ensure each sub-network receives simplified, steady, and accurate learning objectives. Comprehensive empirical validations on different tasks demonstrate that with only one-shot training, Scala learns slimmable representation without modifying the original ViT structure and matches the performance of Separate Training. Compared with the prior art, Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters.
Related papers
- Applying ViT in Generalized Few-shot Semantic Segmentation [0.0]
This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework.
We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models.
We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks.
arXiv Detail & Related papers (2024-08-27T11:04:53Z) - Merging Vision Transformers from Different Tasks and Domains [46.40701388197936]
This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model.
Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched.
arXiv Detail & Related papers (2023-12-25T09:32:28Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.