Generating Accurate Synthetic Survival Data by Conditioning on Outcomes
- URL: http://arxiv.org/abs/2405.17333v2
- Date: Tue, 05 Aug 2025 20:45:08 GMT
- Title: Generating Accurate Synthetic Survival Data by Conditioning on Outcomes
- Authors: Mohd Ashhad, Ricardo Henao,
- Abstract summary: Synthetically generated data can improve privacy, fairness, and data accessibility.<n>One key challenge in this setting is censoring, i.e., the timing of an event is unknown in some cases.<n>Existing methods struggle to accurately reproduce the distributions of both observed and censored event times when generating synthetic data.
- Score: 16.401141867387324
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Synthetically generated data can improve privacy, fairness, and data accessibility; however, it can be challenging in specialized scenarios such as survival analysis. One key challenge in this setting is censoring, i.e., the timing of an event is unknown in some cases. Existing methods struggle to accurately reproduce the distributions of both observed and censored event times when generating synthetic data. We propose a conceptually simple approach that generates covariates conditioned on event times and censoring indicators by leveraging existing tabular data generation models without making assumptions about the mechanism underlying censoring. Experiments on real-world datasets demonstrate that our method consistently outperforms baselines and improves downstream survival model performance.
Related papers
- SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis [34.89334607334426]
Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death.<n>SurvDiff is an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis.<n>We show that survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets.
arXiv Detail & Related papers (2025-09-26T13:50:29Z) - SynDelay: A Synthetic Dataset for Delivery Delay Prediction [50.56729406793283]
We present SynDelay, a synthetic dataset designed for delivery delay prediction.<n>It is publicly available through the Supply Chain Data Hub, an open initiative promoting dataset sharing and benchmarking in supply chain AI.
arXiv Detail & Related papers (2025-08-30T21:54:37Z) - Latent Noise Injection for Private and Statistically Aligned Synthetic Data Generation [7.240170769827935]
Synthetic data generation has become essential for scalable, privacy-preserving statistical analysis.<n>We propose a Latent Noise Injection method using Masked Autoregressive Flows (MAF)<n>Instead of directly sampling from the trained model, our method perturbs each data point in the latent space and maps it back to the data domain.
arXiv Detail & Related papers (2025-06-19T22:22:57Z) - Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events [5.619671817895425]
Extreme events, such as market crashes, natural disasters, and pandemics, are rare but catastrophic.<n>While data-driven methods offer powerful capabilities for extreme event modeling, they require abundant training data, yet extreme event data is inherently scarce.<n>This survey provides the first overview of synthetic data generation for extreme events.
arXiv Detail & Related papers (2025-06-04T20:21:23Z) - Tackling Data Heterogeneity in Federated Time Series Forecasting [61.021413959988216]
Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting.
Most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices to a central cloud server.
We propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers.
arXiv Detail & Related papers (2024-11-24T04:56:45Z) - Multi-modal Data Binding for Survival Analysis Modeling with Incomplete Data and Annotations [19.560652381770243]
We introduce a novel framework that simultaneously handles incomplete data across modalities and censored survival labels.
Our approach employs advanced foundation models to encode individual modalities and align them into a universal representation space.
The proposed method demonstrates outstanding prediction accuracy in two survival analysis tasks on both employed datasets.
arXiv Detail & Related papers (2024-07-25T02:55:39Z) - A Temporally Disentangled Contrastive Diffusion Model for Spatiotemporal Imputation [35.46631415365955]
We introduce a conditional diffusion framework called C$2$TSD, which incorporates disentangled temporal (trend and seasonality) representations as conditional information.
Our experiments on three real-world datasets demonstrate the superior performance of our approach compared to a number of state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-18T11:59:04Z) - TripleSurv: Triplet Time-adaptive Coordinate Loss for Survival Analysis [15.496918127515665]
We propose a time-adaptive coordinate loss function, TripleSurv, to handle the complexities of learning process and exploit valuable survival time values.
Our TripleSurv is evaluated on three real-world survival datasets and a public synthetic dataset.
arXiv Detail & Related papers (2024-01-05T08:37:57Z) - Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy.
At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z) - CenTime: Event-Conditional Modelling of Censoring in Survival Analysis [49.44664144472712]
We introduce CenTime, a novel approach to survival analysis that directly estimates the time to event.
Our method features an innovative event-conditional censoring mechanism that performs robustly even when uncensored data is scarce.
Our results indicate that CenTime offers state-of-the-art performance in predicting time-to-death while maintaining comparable ranking performance.
arXiv Detail & Related papers (2023-09-07T17:07:33Z) - Copula-Based Deep Survival Models for Dependent Censoring [10.962520289040336]
This paper presents a parametric model of survival that extends modern non-linear survival analysis by relaxing the assumption of conditional independence.
On synthetic and semi-synthetic data, our approach significantly improves estimates of survival distributions compared to the standard that assumes conditional independence in the data.
arXiv Detail & Related papers (2023-06-20T21:51:13Z) - SurvivalGAN: Generating Time-to-Event Data for Survival Analysis [121.84429525403694]
Imbalances in censoring and time horizons cause generative models to experience three new failure modes specific to survival analysis.
We propose SurvivalGAN, a generative model that handles survival data by addressing the imbalance in the censoring and event horizons.
We evaluate this method via extensive experiments on medical datasets.
arXiv Detail & Related papers (2023-02-24T17:03:51Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Breaking the Spurious Causality of Conditional Generation via Fairness
Intervention with Corrective Sampling [77.15766509677348]
Conditional generative models often inherit spurious correlations from the training dataset.
This can result in label-conditional distributions that are imbalanced with respect to another latent attribute.
We propose a general two-step strategy to mitigate this issue.
arXiv Detail & Related papers (2022-12-05T08:09:33Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.