Combining propensity score methods with variational autoencoders for
generating synthetic data in presence of latent sub-groups
- URL: http://arxiv.org/abs/2312.07781v1
- Date: Tue, 12 Dec 2023 22:49:24 GMT
- Title: Combining propensity score methods with variational autoencoders for
generating synthetic data in presence of latent sub-groups
- Authors: Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Daniela
Zoeller, Harald Binder
- Abstract summary: Heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and reflected only in properties of distributions, such as bimodality or skewness.
We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In settings requiring synthetic data generation based on a clinical cohort,
e.g., due to data protection regulations, heterogeneity across individuals
might be a nuisance that we need to control or faithfully preserve. The sources
of such heterogeneity might be known, e.g., as indicated by sub-groups labels,
or might be unknown and thus reflected only in properties of distributions,
such as bimodality or skewness. We investigate how such heterogeneity can be
preserved and controlled when obtaining synthetic data from variational
autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a
low-dimensional latent representation. To faithfully reproduce unknown
heterogeneity reflected in marginal distributions, we propose to combine VAEs
with pre-transformations. For dealing with known heterogeneity due to
sub-groups, we complement VAEs with models for group membership, specifically
from propensity score regression. The evaluation is performed with a realistic
simulation design that features sub-groups and challenging marginal
distributions. The proposed approach faithfully recovers the latter, compared
to synthetic data approaches that focus purely on marginal distributions.
Propensity scores add complementary information, e.g., when visualized in the
latent space, and enable sampling of synthetic data with or without sub-group
specific characteristics. We also illustrate the proposed approach with real
data from an international stroke trial that exhibits considerable distribution
differences between study sites, in addition to bimodality. These results
indicate that describing heterogeneity by statistical approaches, such as
propensity score regression, might be more generally useful for complementing
generative deep learning for obtaining synthetic data that faithfully reflects
structure from clinical cohorts.
Related papers
- Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models [14.651592234678722]
Current diffusion models tend to inherit bias in the training dataset and generate biased synthetic data.
We introduce a novel model that incorporates sensitive guidance to generate fair synthetic data with balanced joint distributions of the target label and sensitive attributes.
Our method effectively mitigates bias in training data while maintaining the quality of the generated samples.
arXiv Detail & Related papers (2024-04-12T06:08:43Z) - Predictive Heterogeneity: Measures and Applications [26.85283526483783]
We propose the emphusable predictive heterogeneity, which takes into account the model capacity and computational constraints.
We show that it can be reliably estimated from finite data with probably approximately correct (PAC) bounds.
Empirically, the explored heterogeneity provides insights for sub-population divisions in income prediction, crop yield prediction and image classification tasks.
arXiv Detail & Related papers (2023-04-01T12:20:06Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Heterogeneous Datasets for Federated Survival Analysis Simulation [6.489759672413373]
This work proposes a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way.
Specifically, we provide two novel dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client.
The implementation of the proposed methods is publicly available in favor of and to encourage common practices to simulate federated environments for survival analysis.
arXiv Detail & Related papers (2023-01-28T11:37:07Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - Scalable Regularised Joint Mixture Models [2.0686407686198263]
In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions.
We propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels.
The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large.
arXiv Detail & Related papers (2022-05-03T13:38:58Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Modelling Heterogeneity Using Bayesian Structured Sparsity [0.0]
How to estimate the effect of some variable differing across observations is a key question in political science.
This paper allows a common way of simplifying complex phenomenon (placing observations with similar effects into discrete groups) to be integrated into regression analysis.
arXiv Detail & Related papers (2021-03-29T19:54:25Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.