TVT: Training-Free Vision Transformer Search on Tiny Datasets
- URL: http://arxiv.org/abs/2311.14337v1
- Date: Fri, 24 Nov 2023 08:24:31 GMT
- Title: TVT: Training-Free Vision Transformer Search on Tiny Datasets
- Authors: Zimian Wei, Hengyue Pan, Lujun Li, Peijie Dong, Zhiliang Tian, Xin
Niu, Dongsheng Li
- Abstract summary: Training-free Vision Transformer (ViT) architecture search is presented to search for a better ViT with zero-cost proxies.
Our TVT searches for the best ViT for distilling with ConvNet teachers via our teacher-aware metric and student-capability metric.
- Score: 32.1204216324339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training-free Vision Transformer (ViT) architecture search is presented to
search for a better ViT with zero-cost proxies. While ViTs achieve significant
distillation gains from CNN teacher models on small datasets, the current
zero-cost proxies in ViTs do not generalize well to the distillation training
paradigm according to our experimental observations. In this paper, for the
first time, we investigate how to search in a training-free manner with the
help of teacher models and devise an effective Training-free ViT (TVT) search
framework. Firstly, we observe that the similarity of attention maps between
ViT and ConvNet teachers affects distill accuracy notably. Thus, we present a
teacher-aware metric conditioned on the feature attention relations between
teacher and student. Additionally, TVT employs the L2-Norm of the student's
weights as the student-capability metric to improve ranking consistency.
Finally, TVT searches for the best ViT for distilling with ConvNet teachers via
our teacher-aware metric and student-capability metric, resulting in impressive
gains in efficiency and effectiveness. Extensive experiments on various tiny
datasets and search spaces show that our TVT outperforms state-of-the-art
training-free search methods. The code will be released.
Related papers
- DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets [30.178427266135756]
Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks.
ViT requires a large amount of data for pre-training.
We introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets.
arXiv Detail & Related papers (2024-04-03T17:58:21Z) - Distilling Efficient Vision Transformers from CNNs for Semantic
Segmentation [12.177329445930276]
We propose a novel CNN-to-ViT KD framework, dubbed C2VKD.
We first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations.
We then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes.
arXiv Detail & Related papers (2023-10-11T07:45:37Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Training-free Transformer Architecture Search [89.88412583106741]
Vision Transformer (ViT) has achieved remarkable success in several computer vision tasks.
Current Transformer Architecture Search (TAS) methods are time-consuming and existing zero-cost proxies in CNN do not generalize well to the ViT search space.
In this paper, for the first time, we investigate how to conduct TAS in a training-free manner and devise an effective training-free TAS scheme.
arXiv Detail & Related papers (2022-03-23T06:06:54Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.