Population Synthesis using Incomplete Information
- URL: http://arxiv.org/abs/2510.00859v1
- Date: Wed, 01 Oct 2025 13:09:14 GMT
- Title: Population Synthesis using Incomplete Information
- Authors: Tanay Rastogi, Daniel Jonsson, Anders Karlström,
- Abstract summary: The paper presents a population synthesis model that utilizes the Wasserstein Generative-Adversarial Network (WGAN) for training on incomplete microsamples.<n>By using a mask matrix to represent missing values, the study proposes a WGAN training algorithm that lets the model learn from a training dataset that has some missing information.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents a population synthesis model that utilizes the Wasserstein Generative-Adversarial Network (WGAN) for training on incomplete microsamples. By using a mask matrix to represent missing values, the study proposes a WGAN training algorithm that lets the model learn from a training dataset that has some missing information. The proposed method aims to address the challenge of missing information in microsamples on one or more attributes due to privacy concerns or data collection constraints. The paper contrasts WGAN models trained on incomplete microsamples with those trained on complete microsamples, creating a synthetic population. We conducted a series of evaluations of the proposed method using a Swedish national travel survey. We validate the efficacy of the proposed method by generating synthetic populations from all the models and comparing them to the actual population dataset. The results from the experiments showed that the proposed methodology successfully generates synthetic data that closely resembles a model trained with complete data as well as the actual population. The paper contributes to the field by providing a robust solution for population synthesis with incomplete data, opening avenues for future research, and highlighting the potential of deep generative models in advancing population synthesis capabilities.
Related papers
- Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z) - Using Imperfect Synthetic Data in Downstream Inference Tasks [50.40949503799331]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - Generating Feasible and Diverse Synthetic Populations Using Diffusion Models [5.689443449061003]
Population synthesis is a critical task that involves generating synthetic yet realistic representations of populations.<n>Deep generative models can potentially synthesize possible attribute combinations that present in the actual population but do not exist in the sample data.<n>In this study, a novel diffusion model-based population synthesis method is proposed to estimate the underlying joint distribution of a population.
arXiv Detail & Related papers (2025-08-06T03:11:27Z) - Less is More: Adaptive Coverage for Synthetic Training Data [20.136698279893857]
This study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset.<n>Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset.
arXiv Detail & Related papers (2025-04-20T06:45:16Z) - A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.<n>Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.<n>We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z) - Group Distributionally Robust Dataset Distillation with Risk Minimization [17.05513836324578]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD.<n>We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups.
arXiv Detail & Related papers (2024-02-07T09:03:04Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - The Big Data Myth: Using Diffusion Models for Dataset Generation to
Train Deep Detection Models [0.15469452301122172]
This study presents a framework for the generation of synthetic datasets by fine-tuning stable diffusion models.
The results of this study reveal that the object detection models trained on synthetic data perform similarly to the baseline model.
arXiv Detail & Related papers (2023-06-16T10:48:52Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Copula-based transferable models for synthetic population generation [1.370096215615823]
Population synthesis involves generating synthetic yet realistic representations of a target population of micro-agents.
Traditional methods, often reliant on target population samples, face limitations due to high costs and small sample sizes.
We propose a novel framework based on copulas to generate synthetic data for target populations where only empirical marginal distributions are known.
arXiv Detail & Related papers (2023-02-17T23:58:14Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.