Related papers: RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

URL: http://arxiv.org/abs/2405.17938v2
Date: Fri, 16 Aug 2024 02:43:59 GMT
Title: RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks
Authors: Seong-Hyeon Hwang, Minsu Kim, Steven Euijong Whang,
Abstract summary: We study the problem of robust data augmentation for regression tasks in the presence of noisy data. C-Mixup is more selective in which samples to mix based on their label distances for better regression performance. We propose RC-Mixup, which tightly integrates C-Mixup with multi-round robust training methods for a synergistic effect.
Score: 27.247270530020664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of robust data augmentation for regression tasks in the presence of noisy data. Data augmentation is essential for generalizing deep learning models, but most of the techniques like the popular Mixup are primarily designed for classification tasks on image data. Recently, there are also Mixup techniques that are specialized to regression tasks like C-Mixup. In comparison to Mixup, which takes linear interpolations of pairs of samples, C-Mixup is more selective in which samples to mix based on their label distances for better regression performance. However, C-Mixup does not distinguish noisy versus clean samples, which can be problematic when mixing and lead to suboptimal model performance. At the same time, robust training has been heavily studied where the goal is to train accurate models against noisy data through multiple rounds of model training. We thus propose our data augmentation strategy RC-Mixup, which tightly integrates C-Mixup with multi-round robust training methods for a synergistic effect. In particular, C-Mixup improves robust training in identifying clean data, while robust training provides cleaner data to C-Mixup for it to perform better. A key advantage of RC-Mixup is that it is data-centric where the robust model training algorithm itself does not need to be modified, but can simply benefit from data mixing. We show in our experiments that RC-Mixup significantly outperforms C-Mixup and robust training baselines on noisy data benchmarks and can be integrated with various robust training methods.

Related papers

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
RegMix: Data Mixture as Regression for Language Model Pre-training [40.45464495981735]
We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance. Our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi.
arXiv Detail & Related papers (2024-07-01T17:31:03Z)
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms. We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law. Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
Over-training with Mixup May Hurt Generalization [32.64382185990981]
We report a previously unobserved phenomenon in Mixup training. On a number of standard datasets, the performance of Mixup-trained models starts to decay after training for a large number of epochs. We show theoretically that Mixup training may introduce undesired data-dependent label noises to the synthesized data.
arXiv Detail & Related papers (2023-03-02T18:37:34Z)
C-Mixup: Improving Generalization in Regression [71.10418219781575]
Mixup algorithm improves generalization by linearly interpolating a pair of examples and their corresponding labels. We propose C-Mixup, which adjusts the sampling probability based on the similarity of the labels. C-Mixup achieves 6.56%, 4.76%, 5.82% improvements in in-distribution generalization, task generalization, and out-of-distribution robustness, respectively.
arXiv Detail & Related papers (2022-10-11T20:39:38Z)
DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix. DoubleMix first generates several perturbed samples for each training data. It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z)
RandoMix: A mixed sample data augmentation method with multiple mixed modes [12.466162659083697]
RandoMix is a mixed-sample data augmentation method designed to address robustness and diversity challenges. We evaluate the effectiveness of RandoMix on diverse datasets, including CIFAR-10/100, Tiny-ImageNet, ImageNet, and Google Speech Commands.
arXiv Detail & Related papers (2022-05-18T05:31:36Z)
Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM) DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z)
Towards Understanding the Data Dependency of Mixup-style Training [14.803285140800542]
In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk. For a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.
arXiv Detail & Related papers (2021-10-14T18:13:57Z)
Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. In this paper, we explore how to apply mixup to natural language processing tasks. We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.