Improving Vision Transformers for Incremental Learning
- URL: http://arxiv.org/abs/2112.06103v1
- Date: Sun, 12 Dec 2021 00:12:33 GMT
- Title: Improving Vision Transformers for Incremental Learning
- Authors: Pei Yu, Yinpeng Chen, Ying Jin, Zicheng Liu
- Abstract summary: This paper studies using Vision Transformers (ViT) in class incremental learning.
ViT has very slow convergence when class number is small.
More bias towards new classes is observed in ViT than CNN-based models.
- Score: 17.276384689286168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies using Vision Transformers (ViT) in class incremental
learning. Surprisingly, naive application of ViT to replace convolutional
neural networks (CNNs) results in performance degradation. Our analysis reveals
three issues of naively using ViT: (a) ViT has very slow convergence when class
number is small, (b) more bias towards new classes is observed in ViT than
CNN-based models, and (c) the proper learning rate of ViT is too low to learn a
good classifier. Base on this analysis, we show these issues can be simply
addressed by using existing techniques: using convolutional stem, balanced
finetuning to correct bias, and higher learning rate for the classifier. Our
simple solution, named ViTIL (ViT for Incremental Learning), achieves the new
state-of-the-art for all three class incremental learning setups by a clear
margin, providing a strong baseline for the research community. For instance,
on ImageNet-1000, our ViTIL achieves 69.20% top-1 accuracy for the protocol of
500 initial classes with 5 incremental steps (100 new classes for each),
outperforming LUCIR+DDE by 1.69%. For more challenging protocol of 10
incremental steps (100 new classes), our method outperforms PODNet by 7.27%
(65.13% vs. 57.86%).
Related papers
- Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery [0.0]
Vision Transformers (ViT) have brought a new wave of research in the field of computer vision.
This paper focuses on the comparison of three key factors of using (or not using) ViT for semantic segmentation of aerial images.
We show that the novel combined weighted loss function significantly boosts the CNN model's performance compared to transfer learning with ViT.
arXiv Detail & Related papers (2024-11-14T00:18:04Z) - Parameter Efficient Fine-tuning of Self-supervised ViTs without Catastrophic Forgetting [0.5249805590164901]
Post-pre-training and fine-tuning on new tasks can significantly degrade the model's original general abilities.
Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains.
Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains.
arXiv Detail & Related papers (2024-04-26T08:35:46Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Better plain ViT baselines for ImageNet-1k [100.80574771242937]
It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data.
This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models.
arXiv Detail & Related papers (2022-05-03T15:54:44Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z) - Compounding the Performance Improvements of Assembled Techniques in a
Convolutional Neural Network [6.938261599173859]
We show how to improve the accuracy and robustness of basic CNN models.
Our proposed assembled ResNet-50 shows improvements in top-1 accuracy from 76.3% to 82.78%, mCE from 76.0% to 48.9% and mFR from 57.7% to 32.3%.
Our approach achieved 1st place in the iFood Competition Fine-Grained Visual Recognition at CVPR 2019.
arXiv Detail & Related papers (2020-01-17T12:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.