Related papers: A Data Cartography based MixUp for Pre-trained Language Models

A Data Cartography based MixUp for Pre-trained Language Models

URL: http://arxiv.org/abs/2205.03403v1
Date: Fri, 6 May 2022 17:59:19 GMT
Title: A Data Cartography based MixUp for Pre-trained Language Models
Authors: Seo Yeon Park and Cornelia Caragea
Abstract summary: MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels. We propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples. We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks.
Score: 47.90235939359225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels. However, selecting random pairs is not potentially an optimal choice. In this work, we propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples. Our proposed TDMixUp first measures confidence, variability, (Swayamdipta et al., 2020), and Area Under the Margin (AUM) (Pleiss et al., 2020) to identify the characteristics of training samples (e.g., as easy-to-learn or ambiguous samples), and then interpolates these characterized samples. We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks. We publicly release our code.

Related papers

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity [36.9096162214815]
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology. We propose a novel sample-wise data mixture approach based on a bottom-up paradigm.
arXiv Detail & Related papers (2025-03-03T13:22:11Z)
Mixtera: A Data Plane for Foundation Model Training [1.797352319167759]
We build and present Mixtera, a data plane for foundation model training. We show that Mixtera does not bottleneck training and scales to 256 GH200 superchips. We also explore the role of mixtures for vision-language models.
arXiv Detail & Related papers (2025-02-27T05:55:44Z)
Test-Time Alignment via Hypothesis Reweighting [56.71167047381817]
Large pretrained models often struggle with underspecified tasks. We propose a novel framework to address the challenge of aligning models to test-time user intent.
arXiv Detail & Related papers (2024-12-11T23:02:26Z)
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms. We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law. Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z)
Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z)
Balanced Data Sampling for Language Model Training with Clustering [96.46042695333655]
We propose ClusterClip Sampling to balance the text distribution of training data for better model training. Extensive experiments validate the effectiveness of ClusterClip Sampling.
arXiv Detail & Related papers (2024-02-22T13:20:53Z)
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small. We propose a novel method that augments training data by incorporating a wealth of examples from other datasets. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z)
Debiased Sample Selection for Combating Noisy Labels [24.296451733127956]
We propose a noIse-Tolerant Expert Model (ITEM) for debiased learning in sample selection. Specifically, to mitigate the training bias, we design a robust network architecture that integrates with multiple experts. By training on the mixture of two class-discriminative mini-batches, the model mitigates the effect of the imbalanced training set.
arXiv Detail & Related papers (2024-01-24T10:37:28Z)
Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification. We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z)
DE-CROP: Data-efficient Certified Robustness for Pretrained Classifiers [21.741026088202126]
We propose a novel way to certify the robustness of pretrained models using only a few training samples. Our proposed approach generates class-boundary and interpolated samples corresponding to each training sample. We obtain significant improvements over the baseline on multiple benchmark datasets and also report similar performance under the challenging black box setup.
arXiv Detail & Related papers (2022-10-17T10:41:18Z)
SMILE: Self-Distilled MIxup for Efficient Transfer LEarning [42.59451803498095]
In this work, we propose SMILE - Self-Distilled Mixup for EffIcient Transfer LEarning. With mixed images as inputs, SMILE regularizes the outputs of CNN feature extractors to learn from the mixed feature vectors of inputs. The triple regularizer balances the mixup effects in both feature and label spaces while bounding the linearity in-between samples for pre-training tasks.
arXiv Detail & Related papers (2021-03-25T16:02:21Z)
DST: Data Selection and joint Training for Learning with Noisy Labels [11.0375827306207]
We propose a Data Selection and joint Training (DST) method to automatically select training samples with accurate annotations. For each iteration, the correctly labeled and predicted labels are reweighted respectively by the probabilities from the mixture model. Experiments on CIFAR-10, CIFAR-100 and Clothing1M demonstrate that DST is the comparable or superior to the state-of-the-art methods.
arXiv Detail & Related papers (2021-03-01T07:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.