Synthetic Data Augmentation for Enhancing Harmful Algal Bloom Detection with Machine Learning
- URL: http://arxiv.org/abs/2503.03794v1
- Date: Wed, 05 Mar 2025 11:50:04 GMT
- Title: Synthetic Data Augmentation for Enhancing Harmful Algal Bloom Detection with Machine Learning
- Authors: Tianyi Huang,
- Abstract summary: Harmful Algal Blooms (HABs) pose severe threats to aquatic and public health, resulting in substantial economic losses globally.<n>This study investigates the use of synthetic data augmentation to enhance HAB monitoring systems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Harmful Algal Blooms (HABs) pose severe threats to aquatic ecosystems and public health, resulting in substantial economic losses globally. Early detection is crucial but often hindered by the scarcity of high-quality datasets necessary for training reliable machine learning (ML) models. This study investigates the use of synthetic data augmentation using Gaussian Copulas to enhance ML-based HAB detection systems. Synthetic datasets of varying sizes (100-1,000 samples) were generated using relevant environmental features$\unicode{x2015}$water temperature, salinity, and UVB radiation$\unicode{x2015}$with corrected Chlorophyll-a concentration as the target variable. Experimental results demonstrate that moderate synthetic augmentation significantly improves model performance (RMSE reduced from 0.4706 to 0.1850; $p < 0.001$). However, excessive synthetic data introduces noise and reduces predictive accuracy, emphasizing the need for a balanced approach to data augmentation. These findings highlight the potential of synthetic data to enhance HAB monitoring systems, offering a scalable and cost-effective method for early detection and mitigation of ecological and public health risks.
Related papers
- A Statistical Approach for Synthetic EEG Data Generation [2.5648452174203062]
This study proposes a method combining correlation analysis and random sampling to generate realistic synthetic EEG data.
A Random Forest model trained to distinguish synthetic from real EEG performs at chance level, indicating high fidelity.
This method provides a scalable, privacy-preserving approach for augmenting EEG datasets, enabling more efficient model training in mental health research.
arXiv Detail & Related papers (2025-04-22T06:48:42Z) - Synthetic Poisoning Attacks: The Impact of Poisoned MRI Image on U-Net Brain Tumor Segmentation [8.955776982854985]
We investigate the impact of synthetic MRI data on the robustness and segmentation accuracy of U-Net models for brain tumor segmentation.<n>To quantify the effect of synthetic data contamination, we train U-Net models on progressively "poisoned" datasets.
arXiv Detail & Related papers (2025-02-06T07:21:19Z) - Enhancing weed detection performance by means of GenAI-based image augmentation [0.0]
This paper investigates a generative AI-based augmentation technique that uses the Stable Diffusion model to produce diverse synthetic images for weed detection models.<n>Results show substantial improvements in mean Average Precision for YOLO models trained with generative AI-augmented datasets.
arXiv Detail & Related papers (2024-11-27T17:00:34Z) - Comprehensive Methodology for Sample Augmentation in EEG Biomarker Studies for Alzheimers Risk Classification [0.0]
Alzheimer's disease (AD), the leading type, accounts for 70% of cases.<n>EEG measures show promise in identifying AD risk, but obtaining large samples for reliable comparisons is challenging.<n>This study integrates signal processing, harmonization, and statistical techniques to enhance sample size and improve AD risk classification reliability.
arXiv Detail & Related papers (2024-11-20T10:31:02Z) - SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised
Learning for Robust Infrared Small Target Detection [53.19618419772467]
Single-frame infrared small target (SIRST) detection aims to recognize small targets from clutter backgrounds.
With the development of Transformer, the scale of SIRST models is constantly increasing.
With a rich diversity of infrared small target data, our algorithm significantly improves the model performance and convergence speed.
arXiv Detail & Related papers (2024-03-08T16:14:54Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research [4.475998415951477]
Generative AI offers a promising approach to generating synthetic images, enhancing dataset diversity.
This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research.
arXiv Detail & Related papers (2023-11-15T21:58:01Z) - PLASTIC: Improving Input and Label Plasticity for Sample Efficient
Reinforcement Learning [54.409634256153154]
In Reinforcement Learning (RL), enhancing sample efficiency is crucial.
In principle, off-policy RL algorithms can improve sample efficiency by allowing multiple updates per environment interaction.
Our study investigates the underlying causes of this phenomenon by dividing plasticity into two aspects.
arXiv Detail & Related papers (2023-06-19T06:14:51Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.