Effective Vision Transformer Training: A Data-Centric Perspective
- URL: http://arxiv.org/abs/2209.15006v1
- Date: Thu, 29 Sep 2022 17:59:46 GMT
- Title: Effective Vision Transformer Training: A Data-Centric Perspective
- Authors: Benjia Zhou and Pichao Wang and Jun Wan and Yanyan Liang and Fan Wang
- Abstract summary: Vision Transformers (ViTs) have shown promising performance compared with Convolutional Neural Networks (CNNs)
In this paper, we define several metrics, including Dynamic Data Proportion (DDP) and Knowledge Assimilation Rate (KAR)
We propose a novel data-centric ViT training framework to dynamically measure the difficulty'' of training samples and generate effective'' samples for models at different training stages.
- Score: 24.02488085447691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have shown promising performance compared with
Convolutional Neural Networks (CNNs), but the training of ViTs is much harder
than CNNs. In this paper, we define several metrics, including Dynamic Data
Proportion (DDP) and Knowledge Assimilation Rate (KAR), to investigate the
training process, and divide it into three periods accordingly: formation,
growth and exploration. In particular, at the last stage of training, we
observe that only a tiny portion of training examples is used to optimize the
model. Given the data-hungry nature of ViTs, we thus ask a simple but important
question: is it possible to provide abundant ``effective'' training examples at
EVERY stage of training? To address this issue, we need to address two critical
questions, \ie, how to measure the ``effectiveness'' of individual training
examples, and how to systematically generate enough number of ``effective''
examples when they are running out. To answer the first question, we find that
the ``difficulty'' of training samples can be adopted as an indicator to
measure the ``effectiveness'' of training samples. To cope with the second
question, we propose to dynamically adjust the ``difficulty'' distribution of
the training data in these evolution stages. To achieve these two purposes, we
propose a novel data-centric ViT training framework to dynamically measure the
``difficulty'' of training samples and generate ``effective'' samples for
models at different training stages. Furthermore, to further enlarge the number
of ``effective'' samples and alleviate the overfitting problem in the late
training stage of ViTs, we propose a patch-level erasing strategy dubbed
PatchErasing. Extensive experiments demonstrate the effectiveness of the
proposed data-centric ViT training framework and techniques.
Related papers
- BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping [64.8477128397529]
We propose a training-required and training-free test-time adaptation framework.
We maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples.
We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets.
arXiv Detail & Related papers (2024-10-20T15:58:43Z) - SwiftLearn: A Data-Efficient Training Method of Deep Learning Models
using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models.
This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages.
We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z) - Experts Weights Averaging: A New General Training Scheme for Vision
Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost.
During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs.
After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [119.70303730341938]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.
ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.
We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z) - Large Deviations for Accelerating Neural Networks Training [5.864710987890994]
We propose the LAD Improved Iterative Training (LIIT), a novel training approach for ANN using large deviations principle.
The LIIT approach uses a Modified Training Sample (MTS) that is generated and iteratively updated using a LAD anomaly score based sampling strategy.
The MTS sample is designed to be well representative of the training data by including most anomalous of the observations in each class.
arXiv Detail & Related papers (2023-03-02T04:14:05Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z) - Understanding new tasks through the lens of training data via
exponential tilting [43.33775132139584]
We consider the problem of reweighing the training samples to gain insights into the distribution of the target task.
We formulate a distribution shift model based on the exponential tilt assumption and learn train data importance weights.
The learned train data weights can then be used for downstream tasks such as target performance evaluation, fine-tuning, and model selection.
arXiv Detail & Related papers (2022-05-26T18:38:43Z) - STraTA: Self-Training with Task Augmentation for Better Few-shot
Learning [77.04780470527432]
We propose STraTA, which stands for Self-Training with Task Augmentation.
Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks.
Our analyses reveal that task augmentation and self-training are both complementary and independently effective.
arXiv Detail & Related papers (2021-09-13T19:14:01Z) - Dynamic Curriculum Learning for Low-Resource Neural Machine Translation [27.993407441922507]
We investigate the effective use of training data for low-resource NMT.
In particular, we propose a dynamic curriculum learning (DCL) method to reorder training samples in training.
This eases training by highlighting easy samples that the current model has enough competence to learn.
arXiv Detail & Related papers (2020-11-30T08:13:41Z) - Efficient Deep Representation Learning by Adaptive Latent Space Sampling [16.320898678521843]
Supervised deep learning requires a large amount of training samples with annotations, which are expensive and time-consuming to obtain.
We propose a novel training framework which adaptively selects informative samples that are fed to the training process.
arXiv Detail & Related papers (2020-03-19T22:17:02Z) - Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.