Deep Generative Modeling-based Data Augmentation with Demonstration
using the BFBT Benchmark Void Fraction Datasets
- URL: http://arxiv.org/abs/2308.10120v1
- Date: Sat, 19 Aug 2023 22:19:41 GMT
- Title: Deep Generative Modeling-based Data Augmentation with Demonstration
using the BFBT Benchmark Void Fraction Datasets
- Authors: Farah Alsafadi, Xu Wu
- Abstract summary: This paper explores the applications of deep generative models (DGMs) that have been widely used for image data generation to scientific data augmentation.
Once trained, DGMs can be used to generate synthetic data that are similar to the training data and significantly expand the dataset size.
- Score: 3.341975883864341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning (DL) has achieved remarkable successes in many disciplines such
as computer vision and natural language processing due to the availability of
``big data''. However, such success cannot be easily replicated in many nuclear
engineering problems because of the limited amount of training data, especially
when the data comes from high-cost experiments. To overcome such a data
scarcity issue, this paper explores the applications of deep generative models
(DGMs) that have been widely used for image data generation to scientific data
augmentation. DGMs, such as generative adversarial networks (GANs), normalizing
flows (NFs), variational autoencoders (VAEs), and conditional VAEs (CVAEs), can
be trained to learn the underlying probabilistic distribution of the training
dataset. Once trained, they can be used to generate synthetic data that are
similar to the training data and significantly expand the dataset size. By
employing DGMs to augment TRACE simulated data of the steady-state void
fractions based on the NUPEC Boiling Water Reactor Full-size Fine-mesh Bundle
Test (BFBT) benchmark, this study demonstrates that VAEs, CVAEs, and GANs have
comparable generative performance with similar errors in the synthetic data,
with CVAEs achieving the smallest errors. The findings shows that DGMs have a
great potential to augment scientific data in nuclear engineering, which proves
effective for expanding the training dataset and enabling other DL models to be
trained more accurately.
Related papers
- Expansive Synthesis: Generating Large-Scale Datasets from Minimal Samples [13.053285552524052]
This paper introduces an innovative Expansive Synthesis model that generates high-fidelity datasets from minimal samples.
We validate our Expansive Synthesis by training classifiers on the generated datasets and comparing their performance toversas trained on larger, original datasets.
arXiv Detail & Related papers (2024-06-25T02:59:02Z) - Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI)
We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization.
Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z) - Synthetic Face Datasets Generation via Latent Space Exploration from Brownian Identity Diffusion [20.352548473293993]
Face Recognition (FR) models are trained on large-scale datasets, which have privacy and ethical concerns.
Lately, the use of synthetic data to complement or replace genuine data for the training of FR models has been proposed.
We introduce a new method, inspired by the physical motion of soft particles subjected to Brownian forces, allowing us to sample identities in a latent space under various constraints.
With this in hands, we generate several face datasets and benchmark them by training FR models, showing that data generated with our method exceeds the performance of previously GAN-based datasets and achieves competitive performance with state-of-the-
arXiv Detail & Related papers (2024-04-30T22:32:02Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Generative adversarial networks for data-scarce spectral applications [0.0]
We report on an application of GANs in the domain of synthetic spectral data generation.
We show that CWGANs can act as a surrogate model with improved performance in the low-data regime.
arXiv Detail & Related papers (2023-07-14T16:27:24Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - The Bearable Lightness of Big Data: Towards Massive Public Datasets in
Scientific Machine Learning [0.0]
We show that lossy compression algorithms offer a realistic pathway for exposing high-fidelity scientific data to open-source data repositories.
In this paper, we outline, construct, and evaluate the requirements for establishing a big data framework.
arXiv Detail & Related papers (2022-07-25T21:44:53Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.