GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics Data
- URL: http://arxiv.org/abs/2404.07356v2
- Date: Tue, 30 Apr 2024 18:29:23 GMT
- Title: GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic Microplastics Data
- Authors: Daniel Platnick, Sourena Khanzadeh, Alireza Sadeghian, Richard Anthony Valenzano,
- Abstract summary: This paper proposes GANsemble: a framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data.
To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.
- Score: 2.307414552248669
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often involves oversampling underrepresented classes or augmenting the existing data to improve model performance. This paper proposes GANsemble: a two-module framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data. First, the data chooser module automates augmentation strategy selection by searching for the best data augmentation strategy. Next, the cGAN module uses this strategy to train a cGAN for generating enhanced synthetic data. We experiment with the GANsemble framework on a small and imbalanced microplastics data set. A Microplastic-cGAN (MPcGAN) algorithm is introduced, and baselines for synthetic microplastics (SYMP) data are established in terms of Frechet Inception Distance (FID) and Inception Scores (IS). We also provide a synthetic microplastics filter (SYMP-Filter) algorithm to increase the quality of generated SYMP. Additionally, we show the best amount of oversampling with augmentation to fix class imbalance in small microplastics data sets. To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.
Related papers
- Generation of synthetic gait data: application to multiple sclerosis patients' gait patterns [0.0]
Multiple sclerosis (MS) is the leading cause of severe non-traumatic disability in young adults and its incidence is increasing worldwide.
The variability of gait impairment in MS necessitates the development of a non-invasive, sensitive, and cost-effective tool for quantitative gait evaluation.
The eGait movement sensor, designed to characterize human gait through unit quaternion time series (QTS) representing hip rotations, is a promising approach.
However, the small sample sizes typical of clinical studies pose challenges for the stability of gait data analysis tools.
arXiv Detail & Related papers (2024-11-15T17:32:01Z) - Cancer-Net SCa-Synth: An Open Access Synthetically Generated 2D Skin Lesion Dataset for Skin Cancer Classification [65.83291923029985]
In the United States, skin cancer ranks as the most commonly diagnosed cancer, presenting a significant public health issue.
Recent advancements in dataset curation and deep learning have shown promise in quick and accurate detection of skin cancer.
Cancer-Net SCa- Synth is an open access synthetically generated 2D skin lesion dataset for skin cancer classification.
arXiv Detail & Related papers (2024-11-08T02:04:21Z) - An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification [2.2940141855172036]
In molecular biology, there has been an explosion of data generated from multi-omics sequencing.
Traditional statistical methods face challenging tasks when dealing with such high dimensional data.
This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features.
arXiv Detail & Related papers (2024-05-16T01:45:55Z) - SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised
Learning for Robust Infrared Small Target Detection [53.19618419772467]
Single-frame infrared small target (SIRST) detection aims to recognize small targets from clutter backgrounds.
With the development of Transformer, the scale of SIRST models is constantly increasing.
With a rich diversity of infrared small target data, our algorithm significantly improves the model performance and convergence speed.
arXiv Detail & Related papers (2024-03-08T16:14:54Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Regularizing Neural Networks with Meta-Learning Generative Models [40.45689466486025]
We present a novel strategy for generative data augmentation called meta generative regularization (MGR)
MGR utilizes synthetic samples in the regularization term for feature extractors instead of in the loss function, e.g., cross-entropy.
Experiments on six datasets showed that MGR is effective particularly when datasets are smaller and stably outperforms baselines.
arXiv Detail & Related papers (2023-07-26T01:47:49Z) - PLASTIC: Improving Input and Label Plasticity for Sample Efficient
Reinforcement Learning [54.409634256153154]
In Reinforcement Learning (RL), enhancing sample efficiency is crucial.
In principle, off-policy RL algorithms can improve sample efficiency by allowing multiple updates per environment interaction.
Our study investigates the underlying causes of this phenomenon by dividing plasticity into two aspects.
arXiv Detail & Related papers (2023-06-19T06:14:51Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Hybrid Deep Learning Model using SPCAGAN Augmentation for Insider Threat
Analysis [7.576808824987132]
Anomaly detection using deep learning requires comprehensive data, but insider threat data is not readily available due to confidentiality concerns.
We propose a linear manifold learning-based generative adversarial network, SPCAGAN, that takes input from heterogeneous data sources.
We show that our proposed approach has a lower error, is more accurate, and generates substantially superior synthetic insider threat data than previous models.
arXiv Detail & Related papers (2022-03-06T02:08:48Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.