A Data Cartography based MixUp for Pre-trained Language Models
- URL: http://arxiv.org/abs/2205.03403v1
- Date: Fri, 6 May 2022 17:59:19 GMT
- Title: A Data Cartography based MixUp for Pre-trained Language Models
- Authors: Seo Yeon Park and Cornelia Caragea
- Abstract summary: MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels.
We propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples.
We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks.
- Score: 47.90235939359225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: MixUp is a data augmentation strategy where additional samples are generated
during training by combining random pairs of training samples and their labels.
However, selecting random pairs is not potentially an optimal choice. In this
work, we propose TDMixUp, a novel MixUp strategy that leverages Training
Dynamics and allows more informative samples to be combined for generating new
data samples. Our proposed TDMixUp first measures confidence, variability,
(Swayamdipta et al., 2020), and Area Under the Margin (AUM) (Pleiss et al.,
2020) to identify the characteristics of training samples (e.g., as
easy-to-learn or ambiguous samples), and then interpolates these characterized
samples. We empirically validate that our method not only achieves competitive
performance using a smaller subset of the training data compared with strong
baselines, but also yields lower expected calibration error on the pre-trained
language model, BERT, on both in-domain and out-of-domain settings in a wide
range of NLP tasks. We publicly release our code.
Related papers
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Balanced Data Sampling for Language Model Training with Clustering [96.46042695333655]
We propose ClusterClip Sampling to balance the text distribution of training data for better model training.
Extensive experiments validate the effectiveness of ClusterClip Sampling.
arXiv Detail & Related papers (2024-02-22T13:20:53Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Debiased Sample Selection for Combating Noisy Labels [24.296451733127956]
We propose a noIse-Tolerant Expert Model (ITEM) for debiased learning in sample selection.
Specifically, to mitigate the training bias, we design a robust network architecture that integrates with multiple experts.
By training on the mixture of two class-discriminative mini-batches, the model mitigates the effect of the imbalanced training set.
arXiv Detail & Related papers (2024-01-24T10:37:28Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - DE-CROP: Data-efficient Certified Robustness for Pretrained Classifiers [21.741026088202126]
We propose a novel way to certify the robustness of pretrained models using only a few training samples.
Our proposed approach generates class-boundary and interpolated samples corresponding to each training sample.
We obtain significant improvements over the baseline on multiple benchmark datasets and also report similar performance under the challenging black box setup.
arXiv Detail & Related papers (2022-10-17T10:41:18Z) - SMILE: Self-Distilled MIxup for Efficient Transfer LEarning [42.59451803498095]
In this work, we propose SMILE - Self-Distilled Mixup for EffIcient Transfer LEarning.
With mixed images as inputs, SMILE regularizes the outputs of CNN feature extractors to learn from the mixed feature vectors of inputs.
The triple regularizer balances the mixup effects in both feature and label spaces while bounding the linearity in-between samples for pre-training tasks.
arXiv Detail & Related papers (2021-03-25T16:02:21Z) - DST: Data Selection and joint Training for Learning with Noisy Labels [11.0375827306207]
We propose a Data Selection and joint Training (DST) method to automatically select training samples with accurate annotations.
For each iteration, the correctly labeled and predicted labels are reweighted respectively by the probabilities from the mixture model.
Experiments on CIFAR-10, CIFAR-100 and Clothing1M demonstrate that DST is the comparable or superior to the state-of-the-art methods.
arXiv Detail & Related papers (2021-03-01T07:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.