Data Generation in Low Sample Size Setting Using Manifold Sampling and a
Geometry-Aware VAE
- URL: http://arxiv.org/abs/2103.13751v1
- Date: Thu, 25 Mar 2021 11:07:10 GMT
- Title: Data Generation in Low Sample Size Setting Using Manifold Sampling and a
Geometry-Aware VAE
- Authors: Cl\'ement Chadebec and St\'ephanie Allassonni\`ere
- Abstract summary: We develop two non emphprior-dependent generation procedures based on the geometry of the latent space.
The latter method is used to perform data augmentation in a small sample size setting and is validated across various standard and emphreal-life data sets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While much efforts have been focused on improving Variational Autoencoders
through richer posterior and prior distributions, little interest was shown in
amending the way we generate the data. In this paper, we develop two non
\emph{prior-dependent} generation procedures based on the geometry of the
latent space seen as a Riemannian manifold. The first one consists in sampling
along geodesic paths which is a natural way to explore the latent space while
the second one consists in sampling from the inverse of the metric volume
element which is easier to use in practice. Both methods are then compared to
\emph{prior-based} methods on various data sets and appear well suited for a
limited data regime. Finally, the latter method is used to perform data
augmentation in a small sample size setting and is validated across various
standard and \emph{real-life} data sets. In particular, this scheme allows to
greatly improve classification results on the OASIS database where balanced
accuracy jumps from 80.7% for a classifier trained with the raw data to 89.1%
when trained only with the synthetic data generated by our method. Such results
were also observed on 4 standard data sets.
Related papers
- SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity [36.9096162214815]
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology.
We propose a novel sample-wise data mixture approach based on a bottom-up paradigm.
arXiv Detail & Related papers (2025-03-03T13:22:11Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.
We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.
We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Dataset Quantization with Active Learning based Adaptive Sampling [11.157462442942775]
We show that maintaining performance is feasible even with uneven sample distributions.
We propose a novel active learning based adaptive sampling strategy to optimize the sample selection.
Our approach outperforms the state-of-the-art dataset compression methods.
arXiv Detail & Related papers (2024-07-09T23:09:18Z) - RECOST: External Knowledge Guided Data-efficient Instruction Tuning [25.985023475991625]
We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset.
We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
arXiv Detail & Related papers (2024-02-27T09:47:36Z) - Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS)
Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Distance in Latent Space as Novelty Measure [0.0]
We propose to intelligently select samples when constructing data sets.
The selection methodology is based on the presumption that two dissimilar samples are worth more than two similar samples in a data set.
By using a self-supervised method to construct the latent space, it is ensured that the space fits the data well and that any upfront labeling effort can be avoided.
arXiv Detail & Related papers (2020-03-31T09:14:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.