A Reinforcement Learning Approach to Synthetic Data Generation
- URL: http://arxiv.org/abs/2512.21395v1
- Date: Wed, 24 Dec 2025 19:26:37 GMT
- Title: A Reinforcement Learning Approach to Synthetic Data Generation
- Authors: Natalia Espinosa-Dice, Nicholas J. Jackson, Chao Yan, Aaron Lee, Bradley A. Malin,
- Abstract summary: We introduce RLSyn, a novel framework that models the data generator as a policy over patient records.<n>We benchmark it against state-of-the-art generative adversarial networks (GANs) and diffusion-based methods across privacy, utility, and fidelity evaluations.
- Score: 8.293402602656736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data generation (SDG) is a promising approach for enabling data sharing in biomedical studies while preserving patient privacy. Yet, state-of-the-art generative models often require large datasets and complex training procedures, limiting their applicability in small-sample settings. In this work, we reframe SDG as a reinforcement learning (RL) problem and introduce RLSyn, a novel framework that models the data generator as a stochastic policy over patient records and optimizes it using Proximal Policy Optimization with discriminator-derived rewards, yielding more stable and data-efficient training. We evaluate RLSyn on two biomedical datasets - AI-READI and MIMIC-IV- and benchmark it against state-of-the-art generative adversarial networks (GANs) and diffusion-based methods across extensive privacy, utility, and fidelity evaluations. RL-Syn performs comparably to diffusion models and outperforms GANs on MIMIC-IV, while outperforming both diffusion models and GANs on the smaller AI-READI dataset. These results demonstrate that reinforcement learning provides a principled and effective alternative for synthetic biomedical data generation, particularly in data-scarce regimes.
Related papers
- Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z) - Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z) - MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning [4.579424650757833]
Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks.<n>We introduce Generative Reward Learning for Medical Reasoning (MedGR$2$), a novel framework that creates a self-improving virtuous cycle.<n>Our experiments demonstrate that SFT with MedGR$2$-produced data already surpasses baselines trained on large-scale, human-curated datasets.
arXiv Detail & Related papers (2025-08-28T08:41:32Z) - Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z) - TarDiff: Target-Oriented Diffusion Guidance for Synthetic Electronic Health Record Time Series Generation [26.116599951658454]
Time-series generation is crucial for advancing clinical machine learning models.<n>We argue that fidelity to observed data alone does not guarantee better model performance.<n>We propose TarDiff, a novel target-oriented diffusion framework that integrates task-specific influence guidance.
arXiv Detail & Related papers (2025-04-24T14:36:10Z) - Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data [6.318463500874778]
We propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale.<n>Our approach ensures biologically and diagnostically meaningful variations in the generated data.<n>We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using 60x-760x less data than models trained on large real-world datasets.
arXiv Detail & Related papers (2025-04-15T21:17:39Z) - Enhancing EEG Signal Generation through a Hybrid Approach Integrating Reinforcement Learning and Diffusion Models [6.102274021710727]
This study introduces an innovative approach to the synthesis of Electroencephalogram (EEG) signals by integrating diffusion models with reinforcement learning.
Our methodology enhances the generation of EEG signals with detailed temporal and spectral features, enriching the authenticity and diversity of synthetic datasets.
arXiv Detail & Related papers (2024-09-14T07:22:31Z) - Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios [8.062368743143388]
We propose a methodology that integrates artificial inductive biases into the generative process to improve data quality in low-data regimes.<n>We evaluate four approaches (pre-training, model averaging, Model-Agnostic Meta-Learning (MAML), and Domain Search (DRS)) and analyze their impact on the quality of the generated text.<n> Experimental results show that incorporating inductive bias substantially improves performance, with transfer learning methods outperforming meta-learning.
arXiv Detail & Related papers (2024-07-03T12:53:42Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited
Data [125.7135706352493]
Generative adversarial networks (GANs) typically require ample data for training in order to synthesize high-fidelity images.
Recent studies have shown that training GANs with limited data remains formidable due to discriminator overfitting.
This paper introduces a novel strategy called Adaptive Pseudo Augmentation (APA) to encourage healthy competition between the generator and the discriminator.
arXiv Detail & Related papers (2021-11-12T18:13:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.