Data Generation in Low Sample Size Setting Using Manifold Sampling and a
Geometry-Aware VAE
- URL: http://arxiv.org/abs/2103.13751v1
- Date: Thu, 25 Mar 2021 11:07:10 GMT
- Title: Data Generation in Low Sample Size Setting Using Manifold Sampling and a
Geometry-Aware VAE
- Authors: Cl\'ement Chadebec and St\'ephanie Allassonni\`ere
- Abstract summary: We develop two non emphprior-dependent generation procedures based on the geometry of the latent space.
The latter method is used to perform data augmentation in a small sample size setting and is validated across various standard and emphreal-life data sets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While much efforts have been focused on improving Variational Autoencoders
through richer posterior and prior distributions, little interest was shown in
amending the way we generate the data. In this paper, we develop two non
\emph{prior-dependent} generation procedures based on the geometry of the
latent space seen as a Riemannian manifold. The first one consists in sampling
along geodesic paths which is a natural way to explore the latent space while
the second one consists in sampling from the inverse of the metric volume
element which is easier to use in practice. Both methods are then compared to
\emph{prior-based} methods on various data sets and appear well suited for a
limited data regime. Finally, the latter method is used to perform data
augmentation in a small sample size setting and is validated across various
standard and \emph{real-life} data sets. In particular, this scheme allows to
greatly improve classification results on the OASIS database where balanced
accuracy jumps from 80.7% for a classifier trained with the raw data to 89.1%
when trained only with the synthetic data generated by our method. Such results
were also observed on 4 standard data sets.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Dataset Quantization with Active Learning based Adaptive Sampling [11.157462442942775]
We show that maintaining performance is feasible even with uneven sample distributions.
We propose a novel active learning based adaptive sampling strategy to optimize the sample selection.
Our approach outperforms the state-of-the-art dataset compression methods.
arXiv Detail & Related papers (2024-07-09T23:09:18Z) - On Pretraining Data Diversity for Self-Supervised Learning [57.91495006862553]
We explore the impact of training with more diverse datasets on the performance of self-supervised learning (SSL) under a fixed computational budget.
Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal.
arXiv Detail & Related papers (2024-03-20T17:59:58Z) - RECOST: External Knowledge Guided Data-efficient Instruction Tuning [25.985023475991625]
We argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset.
We propose a framework dubbed as textbfRECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline.
arXiv Detail & Related papers (2024-02-27T09:47:36Z) - Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS)
Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Bi-level Alignment for Cross-Domain Crowd Counting [113.78303285148041]
Current methods rely on external data for training an auxiliary task or apply an expensive coarse-to-fine estimation.
We develop a new adversarial learning based method, which is simple and efficient to apply.
We evaluate our approach on five real-world crowd counting benchmarks, where we outperform existing approaches by a large margin.
arXiv Detail & Related papers (2022-05-12T02:23:25Z) - Implicit Data Augmentation Using Feature Interpolation for Diversified
Low-Shot Image Generation [11.4559888429977]
Training of generative models can easily diverge in low-data setting.
We propose a novel implicit data augmentation approach which facilitates stable training and synthesize diverse samples.
arXiv Detail & Related papers (2021-12-04T23:55:46Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Distance in Latent Space as Novelty Measure [0.0]
We propose to intelligently select samples when constructing data sets.
The selection methodology is based on the presumption that two dissimilar samples are worth more than two similar samples in a data set.
By using a self-supervised method to construct the latent space, it is ensured that the space fits the data well and that any upfront labeling effort can be avoided.
arXiv Detail & Related papers (2020-03-31T09:14:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.