Toward Valid Generative Clinical Trial Data with Survival Endpoints
- URL: http://arxiv.org/abs/2511.16551v1
- Date: Thu, 20 Nov 2025 17:03:38 GMT
- Title: Toward Valid Generative Clinical Trial Data with Survival Endpoints
- Authors: Perrine Chassat, Van Tuan Nguyen, Lucas Ducrot, Emilie Lanoy, Agathe Guilloux,
- Abstract summary: Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring.<n>We introduce a variational autoencoder (VAE) that jointly generates mixed-type co variables and survival outcomes within a unified latent variable framework, without assuming independent censoring.<n>Our method outperforms GAN baselines on fidelity, utility, and privacy metrics, while revealing systematic miscalibration of type I error and power.
- Score: 4.7846041866823965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clinical trials face mounting challenges: fragmented patient populations, slow enrollment, and unsustainable costs, particularly for late phase trials in oncology and rare diseases. While external control arms built from real-world data have been explored, a promising alternative is the generation of synthetic control arms using generative AI. A central challenge is the generation of time-to-event outcomes, which constitute primary endpoints in oncology and rare disease trials, but are difficult to model under censoring and small sample sizes. Existing generative approaches, largely GAN-based, are data-hungry, unstable, and rely on strong assumptions such as independent censoring. We introduce a variational autoencoder (VAE) that jointly generates mixed-type covariates and survival outcomes within a unified latent variable framework, without assuming independent censoring. Across synthetic and real trial datasets, we evaluate our model in two realistic scenarios: (i) data sharing under privacy constraints, where synthetic controls substitute for original data, and (ii) control-arm augmentation, where synthetic patients mitigate imbalances between treated and control groups. Our method outperforms GAN baselines on fidelity, utility, and privacy metrics, while revealing systematic miscalibration of type I error and power. We propose a post-generation selection procedure that improves calibration, highlighting both progress and open challenges for generative survival modeling.
Related papers
- A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - Synthetic Survival Control: Extending Synthetic Controls for "When-If" Decision [14.313335826236722]
Estimating causal effects on time-to-event outcomes from observational data is challenging due to censoring, limited sample sizes, and non-random treatment assignment.<n>We propose Synthetic Survival Control (SSC) to estimate counterfactual hazard trajectories in a panel data setting.
arXiv Detail & Related papers (2025-11-18T04:36:20Z) - Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z) - Deconstructing Intraocular Pressure: A Non-invasive Multi-Stage Probabilistic Inverse Framework [0.0]
Glaucoma is a leading cause of irreversible blindness driven by elevated intraocular pressure (IOP)<n>We develop a framework to noninvasively estimate unmeasurable variables from sparse, routine data.<n>Our framework achieves excellent agreement with state-of-the-art tonography with precision comparable to direct physical instruments.
arXiv Detail & Related papers (2025-09-17T16:50:23Z) - Latent Noise Injection for Private and Statistically Aligned Synthetic Data Generation [7.240170769827935]
Synthetic data generation has become essential for scalable, privacy-preserving statistical analysis.<n>We propose a Latent Noise Injection method using Masked Autoregressive Flows (MAF)<n>Instead of directly sampling from the trained model, our method perturbs each data point in the latent space and maps it back to the data domain.
arXiv Detail & Related papers (2025-06-19T22:22:57Z) - Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z) - Challenges and Limitations in the Synthetic Generation of mHealth Sensor Data [3.10770247120758]
We introduce a novel evaluation framework designed to measure both the intrinsic quality of synthetic data and its utility in downstream predictive tasks.<n>Our findings reveal critical limitations in the existing approaches, particularly in maintaining cross-modal consistency.<n>We present our future research directions to enhance synthetic time series generation and improve the applicability of generative models in mHealth.
arXiv Detail & Related papers (2025-05-20T11:05:06Z) - Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis [55.959002385347645]
Latent Drifting enables diffusion models to be conditioned for medical images fitted for the complex task of counterfactual image generation.<n>We evaluate our method on three public longitudinal benchmark datasets of brain MRI and chest X-rays for counterfactual image generation.
arXiv Detail & Related papers (2024-12-30T01:59:34Z) - Generation of synthetic gait data: application to multiple sclerosis patients' gait patterns [0.0]
Multiple sclerosis (MS) is the leading cause of severe non-traumatic disability in young adults and its incidence is increasing worldwide.
The variability of gait impairment in MS necessitates the development of a non-invasive, sensitive, and cost-effective tool for quantitative gait evaluation.
The eGait movement sensor, designed to characterize human gait through unit quaternion time series (QTS) representing hip rotations, is a promising approach.
However, the small sample sizes typical of clinical studies pose challenges for the stability of gait data analysis tools.
arXiv Detail & Related papers (2024-11-15T17:32:01Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z) - A General Framework for Survival Analysis and Multi-State Modelling [70.31153478610229]
We use neural ordinary differential equations as a flexible and general method for estimating multi-state survival models.
We show that our model exhibits state-of-the-art performance on popular survival data sets and demonstrate its efficacy in a multi-state setting.
arXiv Detail & Related papers (2020-06-08T19:24:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.