Reprint: a randomized extrapolation based on principal components for
data augmentation
- URL: http://arxiv.org/abs/2204.12024v1
- Date: Tue, 26 Apr 2022 01:38:47 GMT
- Title: Reprint: a randomized extrapolation based on principal components for
data augmentation
- Authors: Jiale Wei, Qiyuan Chen, Pai Peng, Benjamin Guedj, Le Li
- Abstract summary: This paper presents a simple and effective hidden-space data augmentation method for imbalanced data classification.
Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class.
This method involves a label refinement component which allows to synthesize new soft labels for augmented examples.
- Score: 11.449992652644577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data scarcity and data imbalance have attracted a lot of attention in many
fields. Data augmentation, explored as an effective approach to tackle them,
can improve the robustness and efficiency of classification models by
generating new samples. This paper presents REPRINT, a simple and effective
hidden-space data augmentation method for imbalanced data classification. Given
hidden-space representations of samples in each class, REPRINT extrapolates, in
a randomized fashion, augmented examples for target class by using subspaces
spanned by principal components to summarize distribution structure of both
source and target class. Consequently, the examples generated would diversify
the target while maintaining the original geometry of target distribution.
Besides, this method involves a label refinement component which allows to
synthesize new soft labels for augmented examples. Compared with different NLP
data augmentation approaches under a range of data imbalanced scenarios on four
text classification benchmark, REPRINT shows prominent improvements. Moreover,
through comprehensive ablation studies, we show that label refinement is better
than label-preserving for augmented examples, and that our method suggests
stable and consistent improvements in terms of suitable choices of principal
components. Moreover, REPRINT is appealing for its easy-to-use since it
contains only one hyperparameter determining the dimension of subspace and
requires low computational resource.
Related papers
- AEMLO: AutoEncoder-Guided Multi-Label Oversampling [6.255095509216069]
AEMLO is an AutoEncoder-guided Oversampling technique for imbalanced multi-label data.
We show that AEMLO outperforms the existing state-of-the-art methods with extensive empirical studies.
arXiv Detail & Related papers (2024-08-23T14:01:33Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Intra-class Adaptive Augmentation with Neighbor Correction for Deep
Metric Learning [99.14132861655223]
We propose a novel intra-class adaptive augmentation (IAA) framework for deep metric learning.
We reasonably estimate intra-class variations for every class and generate adaptive synthetic samples to support hard samples mining.
Our method significantly improves and outperforms the state-of-the-art methods on retrieval performances by 3%-6%.
arXiv Detail & Related papers (2022-11-29T14:52:38Z) - Leveraging Instance Features for Label Aggregation in Programmatic Weak
Supervision [75.1860418333995]
Programmatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently.
The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources as labeling functions.
Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features when modeling the underlying generative process.
arXiv Detail & Related papers (2022-10-06T07:28:53Z) - Evolving Multi-Label Fuzzy Classifier [5.53329677986653]
Multi-label classification has attracted much attention in the machine learning community to address the problem of assigning single samples to more than one class at the same time.
We propose an evolving multi-label fuzzy classifier (EFC-ML) which is able to self-adapt and self-evolve its structure with new incoming multi-label samples in an incremental, single-pass manner.
arXiv Detail & Related papers (2022-03-29T08:01:03Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Conditional Wasserstein GAN-based Oversampling of Tabular Data for
Imbalanced Learning [10.051309746913512]
We propose an oversampling method based on a conditional Wasserstein GAN.
We benchmark our method against standard oversampling methods and the imbalanced baseline on seven real-world datasets.
arXiv Detail & Related papers (2020-08-20T20:33:56Z) - Heavy-tailed Representations, Text Polarity Classification & Data
Augmentation [11.624944730002298]
We develop a novel method to learn a heavy-tailed embedding with desirable regularity properties.
A classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline.
Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework.
arXiv Detail & Related papers (2020-03-25T19:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.