SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis
- URL: http://arxiv.org/abs/2509.22352v1
- Date: Fri, 26 Sep 2025 13:50:29 GMT
- Title: SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis
- Authors: Marie Brockschmidt, Maresa Schröder, Stefan Feuerriegel,
- Abstract summary: Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death.<n>SurvDiff is an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis.<n>We show that survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets.
- Score: 34.89334607334426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff, an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii) preserves the censoring mechanism. Across multiple datasets, we show that \survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first diffusion model explicitly designed for generating synthetic survival data.
Related papers
- EsurvFusion: An evidential multimodal survival fusion model based on Gaussian random fuzzy numbers [13.518282190712348]
EsurvFusion is designed to combine multimodal data at the decision level.<n>It estimates modality-level reliability through a reliability discounting layer.<n>This is the first work that studies multimodal survival analysis with both uncertainty and reliability.
arXiv Detail & Related papers (2024-12-02T07:35:29Z) - SeqRisk: Transformer-augmented latent variable model for improved survival prediction with longitudinal data [4.1476925904032464]
We propose SeqRisk, a method that combines variational autoencoder (VAE) or longitudinal VAE (LVAE) with a transformer encoder and Cox proportional hazards module for risk prediction.
We demonstrate that SeqRisk performs competitively compared to existing approaches on both simulated and real-world datasets.
arXiv Detail & Related papers (2024-09-19T12:35:25Z) - Multi-modal Data Binding for Survival Analysis Modeling with Incomplete Data and Annotations [19.560652381770243]
We introduce a novel framework that simultaneously handles incomplete data across modalities and censored survival labels.
Our approach employs advanced foundation models to encode individual modalities and align them into a universal representation space.
The proposed method demonstrates outstanding prediction accuracy in two survival analysis tasks on both employed datasets.
arXiv Detail & Related papers (2024-07-25T02:55:39Z) - Generating Accurate Synthetic Survival Data by Conditioning on Outcomes [16.401141867387324]
Synthetically generated data can improve privacy, fairness, and data accessibility.<n>One key challenge in this setting is censoring, i.e., the timing of an event is unknown in some cases.<n>Existing methods struggle to accurately reproduce the distributions of both observed and censored event times when generating synthetic data.
arXiv Detail & Related papers (2024-05-27T16:34:18Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - TripleSurv: Triplet Time-adaptive Coordinate Loss for Survival Analysis [15.496918127515665]
We propose a time-adaptive coordinate loss function, TripleSurv, to handle the complexities of learning process and exploit valuable survival time values.
Our TripleSurv is evaluated on three real-world survival datasets and a public synthetic dataset.
arXiv Detail & Related papers (2024-01-05T08:37:57Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - SurvivalGAN: Generating Time-to-Event Data for Survival Analysis [121.84429525403694]
Imbalances in censoring and time horizons cause generative models to experience three new failure modes specific to survival analysis.
We propose SurvivalGAN, a generative model that handles survival data by addressing the imbalance in the censoring and event horizons.
We evaluate this method via extensive experiments on medical datasets.
arXiv Detail & Related papers (2023-02-24T17:03:51Z) - SurvLatent ODE : A Neural ODE based time-to-event model with competing
risks for longitudinal data improves cancer-associated Deep Vein Thrombosis
(DVT) prediction [68.8204255655161]
We propose a generative time-to-event model, SurvLatent ODE, which parameterizes a latent representation under irregularly sampled data.
Our model then utilizes the latent representation to flexibly estimate survival times for multiple competing events without specifying shapes of event-specific hazard function.
SurvLatent ODE outperforms the current clinical standard Khorana Risk scores for stratifying DVT risk groups.
arXiv Detail & Related papers (2022-04-20T17:28:08Z) - DeepRite: Deep Recurrent Inverse TreatmEnt Weighting for Adjusting
Time-varying Confounding in Modern Longitudinal Observational Data [68.29870617697532]
We propose Deep Recurrent Inverse TreatmEnt weighting (DeepRite) for time-varying confounding in longitudinal data.
DeepRite is shown to recover the ground truth from synthetic data, and estimate unbiased treatment effects from real data.
arXiv Detail & Related papers (2020-10-28T15:05:08Z) - A General Framework for Survival Analysis and Multi-State Modelling [70.31153478610229]
We use neural ordinary differential equations as a flexible and general method for estimating multi-state survival models.
We show that our model exhibits state-of-the-art performance on popular survival data sets and demonstrate its efficacy in a multi-state setting.
arXiv Detail & Related papers (2020-06-08T19:24:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.