Related papers: Learning Imbalanced Data with Vision Transformers

Learning Imbalanced Data with Vision Transformers

URL: http://arxiv.org/abs/2212.02015v1
Date: Mon, 5 Dec 2022 04:05:32 GMT
Title: Learning Imbalanced Data with Vision Transformers
Authors: Zhengzhuo Xu and Ruikang Liu and Shuo Yang and Zenghao Chai and Chun Yuan
Abstract summary: We propose LiVT to train Vision Transformers (ViTs) from scratch only with Long-Tailed (LT) data. With ample and solid evidence, we show that Masked Generative Pretraining (MGP) is more robust than supervised manners. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs.
Score: 17.14790664854141
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. In addition, Binary Cross Entropy (BCE) loss, which shows conspicuous performance with ViTs, encounters predicaments in LTR. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins to deploy it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.

Related papers

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets [30.178427266135756]
Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. ViT requires a large amount of data for pre-training. We introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets.
arXiv Detail & Related papers (2024-04-03T17:58:21Z)
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders [32.2455570714414]
Vision Transformers (ViTs) have become ubiquitous in computer vision. ViTs lack inductive biases, which can make it difficult to train them with limited data. We propose a technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks.
arXiv Detail & Related papers (2023-10-31T17:59:07Z)
Rethink Long-tailed Recognition with Vision Transformers [18.73285611631722]
Vision Transformers (ViT) are hard to train with long-tailed data. ViT learns generalized features in an unsupervised manner. Predictive Distribution (PDC) is a novel metric for Long-Tailed Recognition.
arXiv Detail & Related papers (2023-02-28T03:36:48Z)
Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT) Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z)
Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models. To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training. We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z)
Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training. As-ViT automatically discovers and scales up ViTs in an efficient and principled manner. As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z)
Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training [29.20567759071523]
Vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in computer vision. This paper introduces CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound. Experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results.
arXiv Detail & Related papers (2021-12-07T07:56:50Z)
Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
Chasing Sparsity in Vision Transformers: An End-to-End Exploration [127.10054032751714]
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. This paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. Specifically, instead of training full ViTs, we dynamically extract and train sparseworks, while sticking to a fixed small parameter budget.
arXiv Detail & Related papers (2021-06-08T17:18:00Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.