Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models
- URL: http://arxiv.org/abs/2509.04245v2
- Date: Tue, 16 Sep 2025 03:53:00 GMT
- Title: Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models
- Authors: Chanon Puttanawarut, Natcha Fongsrisin, Porntep Amornritvanich, Panu Looareesuwan, Cholatid Ratanatharathorn,
- Abstract summary: Heart failure (HF) research is constrained by limited access to large, shareable datasets due to privacy regulations and institutional barriers.<n>We generated synthetic HF datasets from institutional data comprising 12,552 unique patients.<n>Deep learning-based synthetic data generation can produce high-fidelity, privacy-preserving HF datasets suitable for research applications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Heart failure (HF) research is constrained by limited access to large, shareable datasets due to privacy regulations and institutional barriers. Synthetic data generation offers a promising solution to overcome these challenges while preserving patient confidentiality. Methods: We generated synthetic HF datasets from institutional data comprising 12,552 unique patients using five deep learning models: tabular variational autoencoder (TVAE), normalizing flow, ADSGAN, SurvivalGAN, and tabular denoising diffusion probabilistic models (TabDDPM). We comprehensively evaluated synthetic data utility through statistical similarity metrics, survival prediction using machine learning and privacy assessments. Results: SurvivalGAN and TabDDPM demonstrated high fidelity to the original dataset, exhibiting similar variable distributions and survival curves after applying histogram equalization. SurvivalGAN (C-indices: 0.71-0.76) and TVAE (C-indices: 0.73-0.76) achieved the strongest performance in survival prediction evaluation, closely matched real data performance (C-indices: 0.73-0.76). Privacy evaluation confirmed protection against re-identification attacks. Conclusions: Deep learning-based synthetic data generation can produce high-fidelity, privacy-preserving HF datasets suitable for research applications. This publicly available synthetic dataset addresses critical data sharing barriers and provides a valuable resource for advancing HF research and predictive modeling.
Related papers
- Quality Degradation Attack in Synthetic Data [5.461072909384133]
This study investigates quality attacks initiated by adversaries who possess access to the real dataset or control over the generation process.<n>We formalize a corresponding threat model and empirically evaluate the effectiveness of targeted manipulations of real data.
arXiv Detail & Related papers (2026-01-06T11:43:31Z) - Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z) - A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets [0.9940728137241215]
Healthcare research and development face significant obstacles due to data scarcity and stringent privacy regulations.<n>We produce artificial datasets that emulate real data statistics while safeguarding patient privacy.<n>This scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare.
arXiv Detail & Related papers (2025-10-12T09:23:43Z) - Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI [0.841508985473488]
We propose a framework for synthetic biomedical time-series data generation based on advanced forecasting models.<n>These synthetic datasets preserve essential temporal and spectral properties of real data.
arXiv Detail & Related papers (2025-10-06T09:32:10Z) - Graph-Convolutional-Beta-VAE for Synthetic Abdominal Aorta Aneurysm Generation [4.363232795241618]
This study presents a beta-Variational Autoencoder Graph Convolutional Neural Network framework for generating synthetic Abdominal Aorta Aneurysms (AAA)<n>Our approach extracts key anatomical features and captures complex statistical relationships within a compact disentangled latent space.<n>The resulting synthetic AAA dataset preserves patient privacy while providing a scalable foundation for medical research, device testing, and computational modeling.
arXiv Detail & Related papers (2025-06-16T15:55:56Z) - Robust Molecular Property Prediction via Densifying Scarce Labeled Data [53.24886143129006]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.
arXiv Detail & Related papers (2025-06-13T15:27:40Z) - Zero-shot generation of synthetic neurosurgical data with large language models [0.7373617024876725]
This study aims to evaluate the capability of zero-shot generation of synthetic neurosurgical data with a large language model (LLM), GPT-4o.<n>Data synthesized with GPT-4o can effectively augment clinical data with small sample sizes, and train ML models for prediction of neurosurgical outcomes.
arXiv Detail & Related papers (2025-02-13T18:21:15Z) - Socially Aware Synthetic Data Generation for Suicidal Ideation Detection
Using Large Language Models [8.832297887534445]
We introduce an innovative strategy that leverages the capabilities of generative AI models to create synthetic data for suicidal ideation detection.
We benchmarked against state-of-the-art NLP classification models, specifically, those centered around the BERT family structures.
Our synthetic data-driven method, informed by social factors, offers consistent F1-scores of 0.82 for both models.
arXiv Detail & Related papers (2024-01-25T18:25:05Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - Synthesize High-dimensional Longitudinal Electronic Health Records via
Hierarchical Autoregressive Language Model [40.473866438962034]
Synthetic electronic health records can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis.
We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR.
arXiv Detail & Related papers (2023-04-04T23:53:34Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.