Semi-supervised Vision Transformers at Scale
- URL: http://arxiv.org/abs/2208.05688v1
- Date: Thu, 11 Aug 2022 08:11:54 GMT
- Title: Semi-supervised Vision Transformers at Scale
- Authors: Zhaowei Cai, Avinash Ravichandran, Paolo Favaro, Manchen Wang, Davide
Modolo, Rahul Bhotika, Zhuowen Tu, Stefano Soatto
- Abstract summary: We study semi-supervised learning (SSL) for vision transformers (ViT)
We propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning.
Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting.
- Score: 93.0621675558895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study semi-supervised learning (SSL) for vision transformers (ViT), an
under-explored topic despite the wide adoption of the ViT architectures to
different tasks. To tackle this problem, we propose a new SSL pipeline,
consisting of first un/self-supervised pre-training, followed by supervised
fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised
fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher
framework instead of the popular FixMatch, since the former is more stable and
delivers higher accuracy for semi-supervised vision transformers. In addition,
we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled
samples and their pseudo labels for improved regularization, which is important
for training ViTs with weak inductive bias. Our proposed method, dubbed
Semi-ViT, achieves comparable or better performance than the CNN counterparts
in the semi-supervised classification setting. Semi-ViT also enjoys the
scalability benefits of ViTs that can be readily scaled up to large-size models
with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive
80% top-1 accuracy on ImageNet using only 1% labels, which is comparable with
Inception-v4 using 100% ImageNet labels.
Related papers
- Reviving Shift Equivariance in Vision Transformers [12.720600348466498]
We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models.
Our algorithms enable ViT, and its variants such as Twins to achieve 100% consistency with respect to input shift.
arXiv Detail & Related papers (2023-06-13T00:13:11Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Elastic Weight Consolidation Improves the Robustness of Self-Supervised
Learning Methods under Transfer [4.2141621237414615]
Self-supervised representation learning (SSL) methods provide an effective label-free initial condition for fine-tuning downstream tasks.
We re-interpret SSL fine-tuning under the lens of Bayesian continual learning and consider regularization through the Elastic Weight Consolidation (EWC) framework.
We demonstrate that self-regularization against an initial SSL backbone improves worst sub-group performance in Waterbirds by 5% and Celeb-A by 2%.
arXiv Detail & Related papers (2022-10-28T19:00:25Z) - Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z) - The Principle of Diversity: Training Stronger Vision Transformers Calls
for Reducing All Levels of Redundancy [111.49944789602884]
This paper systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space.
We propose corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.
arXiv Detail & Related papers (2022-03-12T04:48:12Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.