When Chosen Wisely, More Data Is What You Need: A Universal
Sample-Efficient Strategy For Data Augmentation
- URL: http://arxiv.org/abs/2203.09391v1
- Date: Thu, 17 Mar 2022 15:33:52 GMT
- Title: When Chosen Wisely, More Data Is What You Need: A Universal
Sample-Efficient Strategy For Data Augmentation
- Authors: Ehsan Kamalloo, Mehdi Rezagholizadeh, Ali Ghodsi
- Abstract summary: We present a universal Data Augmentation (DA) technique, called Glitter, to overcome both issues.
Glitter adaptively selects a subset of worst-case samples with maximal loss, analogous to adversarial DA.
Our experiments on the GLUE benchmark, SQuAD, and HellaSwag in three widely used training setups reveal that Glitter is substantially faster to train and achieves a competitive performance.
- Score: 19.569164094496955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data Augmentation (DA) is known to improve the generalizability of deep
neural networks. Most existing DA techniques naively add a certain number of
augmented samples without considering the quality and the added computational
cost of these samples. To tackle this problem, a common strategy, adopted by
several state-of-the-art DA methods, is to adaptively generate or re-weight
augmented samples with respect to the task objective during training. However,
these adaptive DA methods: (1) are computationally expensive and not
sample-efficient, and (2) are designed merely for a specific setting. In this
work, we present a universal DA technique, called Glitter, to overcome both
issues. Glitter can be plugged into any DA method, making training
sample-efficient without sacrificing performance. From a pre-generated pool of
augmented samples, Glitter adaptively selects a subset of worst-case samples
with maximal loss, analogous to adversarial DA. Without altering the training
strategy, the task objective can be optimized on the selected subset. Our
thorough experiments on the GLUE benchmark, SQuAD, and HellaSwag in three
widely used training setups including consistency training, self-distillation
and knowledge distillation reveal that Glitter is substantially faster to train
and achieves a competitive performance, compared to strong baselines.
Related papers
- Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS)
Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z) - Sample Dropout: A Simple yet Effective Variance Reduction Technique in
Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate.
We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z) - ScoreMix: A Scalable Augmentation Strategy for Training GANs with
Limited Data [93.06336507035486]
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available.
We present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks.
arXiv Detail & Related papers (2022-10-27T02:55:15Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Sampling Through the Lens of Sequential Decision Making [9.101505546901999]
We propose a reward-guided sampling strategy called Adaptive Sample with Reward (ASR)
Our approach optimally adjusts the sampling process to achieve optimal performance.
Empirical results in information retrieval and clustering demonstrate ASR's superb performance across different datasets.
arXiv Detail & Related papers (2022-08-17T04:01:29Z) - ReSmooth: Detecting and Utilizing OOD Samples when Training with Data
Augmentation [57.38418881020046]
Recent DA techniques always meet the need for diversity in augmented training samples.
An augmentation strategy that has a high diversity usually introduces out-of-distribution (OOD) augmented samples.
We propose ReSmooth, a framework that firstly detects OOD samples in augmented samples and then leverages them.
arXiv Detail & Related papers (2022-05-25T09:29:27Z) - SelectAugment: Hierarchical Deterministic Sample Selection for Data
Augmentation [72.58308581812149]
We propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner.
Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio.
In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved.
arXiv Detail & Related papers (2021-12-06T08:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.