Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference
- URL: http://arxiv.org/abs/2510.18768v1
- Date: Tue, 21 Oct 2025 16:16:00 GMT
- Title: Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference
- Authors: Harry Amad, Zhaozhi Qian, Dennis Frauen, Julianna Piskorz, Stefan Feuerriegel, Mihaela van der Schaar,
- Abstract summary: Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
- Score: 89.5628648718851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Causal inference is essential for developing and evaluating medical interventions, yet real-world medical datasets are often difficult to access due to regulatory barriers. This makes synthetic data a potentially valuable asset that enables these medical analyses, along with the development of new inference methods themselves. Generative models can produce synthetic data that closely approximate real data distributions, yet existing methods do not consider the unique challenges that downstream causal inference tasks, and specifically those focused on treatments, pose. We establish a set of desiderata that synthetic data containing treatments should satisfy to maximise downstream utility: preservation of (i) the covariate distribution, (ii) the treatment assignment mechanism, and (iii) the outcome generation mechanism. Based on these desiderata, we propose a set of evaluation metrics to assess such synthetic data. Finally, we present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine that mimics the data-generating process of data containing treatments and optimises for our desiderata. We empirically demonstrate that STEAM achieves state-of-the-art performance across our metrics as compared to existing generative models, particularly as the complexity of the true data-generating process increases.
Related papers
- A Reinforcement Learning Approach to Synthetic Data Generation [8.293402602656736]
We introduce RLSyn, a novel framework that models the data generator as a policy over patient records.<n>We benchmark it against state-of-the-art generative adversarial networks (GANs) and diffusion-based methods across privacy, utility, and fidelity evaluations.
arXiv Detail & Related papers (2025-12-24T19:26:37Z) - Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI [0.841508985473488]
We propose a framework for synthetic biomedical time-series data generation based on advanced forecasting models.<n>These synthetic datasets preserve essential temporal and spectral properties of real data.
arXiv Detail & Related papers (2025-10-06T09:32:10Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective [61.284843894545475]
Complex algorithms for treatment effect estimation are ineffective when handling insufficiently labeled training sets.<n>We propose FCCM, which transforms the optimization objective into the textitFactual and textitCounterfactual Coverage Maximization to ensure effective radius reduction during data acquisition.<n> benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets.
arXiv Detail & Related papers (2025-05-08T13:42:00Z) - TarDiff: Target-Oriented Diffusion Guidance for Synthetic Electronic Health Record Time Series Generation [26.116599951658454]
Time-series generation is crucial for advancing clinical machine learning models.<n>We argue that fidelity to observed data alone does not guarantee better model performance.<n>We propose TarDiff, a novel target-oriented diffusion framework that integrates task-specific influence guidance.
arXiv Detail & Related papers (2025-04-24T14:36:10Z) - Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments [42.8983261737774]
We investigate the efficacy of data augmentation techniques leveraging SoTA AI-based methods for synthetic data generation.
Inspired by practical and experimental motivations, we explore fusion strategies of real and synthetic data to improve forecasting models.
arXiv Detail & Related papers (2024-06-07T12:36:31Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Evaluation of the Synthetic Electronic Health Records [3.255030588361125]
This work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets.
We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records.
arXiv Detail & Related papers (2022-10-16T22:46:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.