Related papers: RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases

RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases

URL: http://arxiv.org/abs/2510.06267v1
Date: Mon, 06 Oct 2025 03:59:09 GMT
Title: RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases
Authors: Khartik Uppalapati, Shakeel Abdulkareem, Bora Yimenicioglu,
Abstract summary: We propose a knowledge-guided, continuous-time diffusion framework that generates trajectories for ultra-rare diseases.<n>RareGraph- Synth unifies five public resources into a heterogeneous knowledge graph comprising approximately 8 M typed edges.<n>Timestamped sequences of lab-code, medication-code, and adverse-event-flag triples that contain no protected health information are produced.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose RareGraph-Synth, a knowledge-guided, continuous-time diffusion framework that generates realistic yet privacy-preserving synthetic electronic-health-record (EHR) trajectories for ultra-rare diseases. RareGraph-Synth unifies five public resources: Orphanet/Orphadata, the Human Phenotype Ontology (HPO), the GARD rare-disease KG, PrimeKG, and the FDA Adverse Event Reporting System (FAERS) into a heterogeneous knowledge graph comprising approximately 8 M typed edges. Meta-path scores extracted from this 8-million-edge KG modulate the per-token noise schedule in the forward stochastic differential equation, steering generation toward biologically plausible lab-medication-adverse-event co-occurrences while retaining score-based diffusion model stability. The reverse denoiser then produces timestamped sequences of lab-code, medication-code, and adverse-event-flag triples that contain no protected health information. On simulated ultra-rare-disease cohorts, RareGraph-Synth lowers categorical Maximum Mean Discrepancy by 40 percent relative to an unguided diffusion baseline and by greater than 60 percent versus GAN counterparts, without sacrificing downstream predictive utility. A black-box membership-inference evaluation using the DOMIAS attacker yields AUROC approximately 0.53, well below the 0.55 safe-release threshold and substantially better than the approximately 0.61 plus or minus 0.03 observed for non-KG baselines, demonstrating strong resistance to re-identification. These results suggest that integrating biomedical knowledge graphs directly into diffusion noise schedules can simultaneously enhance fidelity and privacy, enabling safer data sharing for rare-disease research.

Related papers

Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z)
PUNCH: Physics-informed Uncertainty-aware Network for Coronary Hemodynamics [8.812266680285369]
We introduce a non-diagnosed, uncertainty-aware framework for estimating coronary flow reserve directly from standard angiography.<n>The system integrates physics-informed neural networks with variational inference to infer coronary blood flow from first-principles models of contrast transport.<n>The pipeline runs in approximately three minutes per patient on a single GPU, with no population-level training.
arXiv Detail & Related papers (2026-01-23T21:47:23Z)
I Detect What I Don't Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging [2.384534878752428]
We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels.<n>A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead.<n>On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982; on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269.
arXiv Detail & Related papers (2025-11-05T23:28:14Z)
A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets [0.9940728137241215]
Healthcare research and development face significant obstacles due to data scarcity and stringent privacy regulations.<n>We produce artificial datasets that emulate real data statistics while safeguarding patient privacy.<n>This scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare.
arXiv Detail & Related papers (2025-10-12T09:23:43Z)
Adapting HFMCA to Graph Data: Self-Supervised Learning for Generalizable fMRI Representations [57.054499278843856]
Functional magnetic resonance imaging (fMRI) analysis faces significant challenges due to limited dataset sizes and domain variability between studies.<n>Traditional self-supervised learning methods inspired by computer vision often rely on positive and negative sample pairs.<n>We propose adapting a recently developed Hierarchical Functional Maximal Correlation Algorithm (HFMCA) to graph-structured fMRI data.
arXiv Detail & Related papers (2025-10-05T12:35:01Z)
Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models [0.0]
Heart failure (HF) research is constrained by limited access to large, shareable datasets due to privacy regulations and institutional barriers.<n>We generated synthetic HF datasets from institutional data comprising 12,552 unique patients.<n>Deep learning-based synthetic data generation can produce high-fidelity, privacy-preserving HF datasets suitable for research applications.
arXiv Detail & Related papers (2025-09-04T14:17:58Z)
Detection of Autonomic Dysreflexia in Individuals With Spinal Cord Injury Using Multimodal Wearable Sensors [2.208475400165877]
Autonomic Dysreflexia (AD) is a potentially life-threatening condition characterized by sudden, severe blood pressure spikes in individuals with spinal cord injury (SCI)<n>This study presents a non-invasive, explainable machine learning framework for detecting AD using multimodal wearable sensors.
arXiv Detail & Related papers (2025-07-23T21:18:23Z)
Regressor-free Molecule Generation to Support Drug Response Prediction [83.25894107956735]
Conditional generation based on the target IC50 score can obtain a more effective sampling space. Regressor-free guidance combines a diffusion model's score estimation with a regression controller model's gradient based on number labels.
arXiv Detail & Related papers (2024-05-23T13:22:17Z)
A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds [49.34500499203579]
We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics. We generate high-quality synthetic fMRI data based on user-supplied demographics.
arXiv Detail & Related papers (2024-05-13T17:49:20Z)
Guided Discrete Diffusion for Electronic Health Record Generation [47.129056768385084]
EHRs are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs.
arXiv Detail & Related papers (2024-04-18T16:50:46Z)
Fairness-Aware Data Augmentation for Cardiac MRI using Text-Conditioned Diffusion Models [1.6581402323174208]
We propose a method to alleviate imbalances inherent in datasets through the generation of synthetic data.<n>We adopt ControlNet based on a denoising diffusion probabilistic model to condition on text assembled from patient metadata and cardiac geometry.<n>Our experiments demonstrate the effectiveness of the proposed approach in mitigating dataset imbalances.
arXiv Detail & Related papers (2024-03-28T15:41:43Z)
SurvLatent ODE : A Neural ODE based time-to-event model with competing risks for longitudinal data improves cancer-associated Deep Vein Thrombosis (DVT) prediction [68.8204255655161]
We propose a generative time-to-event model, SurvLatent ODE, which parameterizes a latent representation under irregularly sampled data. Our model then utilizes the latent representation to flexibly estimate survival times for multiple competing events without specifying shapes of event-specific hazard function. SurvLatent ODE outperforms the current clinical standard Khorana Risk scores for stratifying DVT risk groups.
arXiv Detail & Related papers (2022-04-20T17:28:08Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.