Related papers: Improving Vision Transformers for Incremental Learning

Improving Vision Transformers for Incremental Learning

URL: http://arxiv.org/abs/2112.06103v1
Date: Sun, 12 Dec 2021 00:12:33 GMT
Title: Improving Vision Transformers for Incremental Learning
Authors: Pei Yu, Yinpeng Chen, Ying Jin, Zicheng Liu
Abstract summary: This paper studies using Vision Transformers (ViT) in class incremental learning. ViT has very slow convergence when class number is small. More bias towards new classes is observed in ViT than CNN-based models.
Score: 17.276384689286168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies using Vision Transformers (ViT) in class incremental learning. Surprisingly, naive application of ViT to replace convolutional neural networks (CNNs) results in performance degradation. Our analysis reveals three issues of naively using ViT: (a) ViT has very slow convergence when class number is small, (b) more bias towards new classes is observed in ViT than CNN-based models, and (c) the proper learning rate of ViT is too low to learn a good classifier. Base on this analysis, we show these issues can be simply addressed by using existing techniques: using convolutional stem, balanced finetuning to correct bias, and higher learning rate for the classifier. Our simple solution, named ViTIL (ViT for Incremental Learning), achieves the new state-of-the-art for all three class incremental learning setups by a clear margin, providing a strong baseline for the research community. For instance, on ImageNet-1000, our ViTIL achieves 69.20% top-1 accuracy for the protocol of 500 initial classes with 5 incremental steps (100 new classes for each), outperforming LUCIR+DDE by 1.69%. For more challenging protocol of 10 incremental steps (100 new classes), our method outperforms PODNet by 7.27% (65.13% vs. 57.86%).

Related papers

Stronger ViTs With Octic Equivariance [13.357266345180296]
Vision Transformers (ViTs) incorporate weight sharing over image patches as an important inductive bias.<n>We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning.<n>We achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.
arXiv Detail & Related papers (2025-05-21T12:22:53Z)
Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery [0.0]
Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. This paper focuses on the comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID.
arXiv Detail & Related papers (2024-11-14T00:18:04Z)
Parameter Efficient Fine-tuning of Self-supervised ViTs without Catastrophic Forgetting [0.5249805590164901]
Post-pre-training and fine-tuning on new tasks can significantly degrade the model's original general abilities. Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains. Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains.
arXiv Detail & Related papers (2024-04-26T08:35:46Z)
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference. This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z)
Better plain ViT baselines for ImageNet-1k [100.80574771242937]
It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models.
arXiv Detail & Related papers (2022-05-03T15:54:44Z)
Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models. To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training. We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z)
A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs. Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z)
Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute. We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z)
Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach basically has no new inventions, which is combined from MoCo v2 and BYOL. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z)
Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network [6.938261599173859]
We show how to improve the accuracy and robustness of basic CNN models. Our proposed assembled ResNet-50 shows improvements in top-1 accuracy from 76.3% to 82.78%, mCE from 76.0% to 48.9% and mFR from 57.7% to 32.3%. Our approach achieved 1st place in the iFood Competition Fine-Grained Visual Recognition at CVPR 2019.
arXiv Detail & Related papers (2020-01-17T12:42:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.