Co-advise: Cross Inductive Bias Distillation
- URL: http://arxiv.org/abs/2106.12378v1
- Date: Wed, 23 Jun 2021 13:19:59 GMT
- Title: Co-advise: Cross Inductive Bias Distillation
- Authors: Sucheng Ren, Zhengqi Gao, Tianyu Hua, Zihui Xue, Yonglong Tian,
Shengfeng He, Hang Zhao
- Abstract summary: We propose a novel distillation-based method to train vision transformers.
We introduce lightweight teachers with different architectural inductive biases to co-advise the student transformer.
Our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.
- Score: 39.61426495884721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers recently are adapted from the community of natural language
processing as a promising substitute of convolution-based neural networks for
visual learning tasks. However, its supremacy degenerates given an insufficient
amount of training data (e.g., ImageNet). To make it into practical utility, we
propose a novel distillation-based method to train vision transformers. Unlike
previous works, where merely heavy convolution-based teachers are provided, we
introduce lightweight teachers with different architectural inductive biases
(e.g., convolution and involution) to co-advise the student transformer. The
key is that teachers with different inductive biases attain different knowledge
despite that they are trained on the same dataset, and such different knowledge
compounds and boosts the student's performance during distillation. Equipped
with this cross inductive bias distillation method, our vision transformers
(termed as CivT) outperform all previous transformers of the same architecture
on ImageNet.
Related papers
- Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers.
We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students)
Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z) - Distilling Inductive Bias: Knowledge Distillation Beyond Model
Compression [6.508088032296086]
Vision Transformers (ViTs) offer the tantalizing prospect of unified information processing across visual and textual domains.
We introduce an innovative ensemble-based distillation approach distilling inductive bias from complementary lightweight teacher models.
Our proposed framework also involves precomputing and storing logits in advance, essentially the unnormalized predictions of the model.
arXiv Detail & Related papers (2023-09-30T13:21:29Z) - Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Cross-Architecture Knowledge Distillation [32.689574589575244]
It is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN)
To deal with this problem, a novel cross-architecture knowledge distillation method is proposed.
The proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.
arXiv Detail & Related papers (2022-07-12T02:50:48Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Training data-efficient image transformers & distillation through
attention [93.22667339525832]
We produce a competitive convolution-free transformer by training on Imagenet only.
Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1%.
arXiv Detail & Related papers (2020-12-23T18:42:10Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.