Deep and diverse population synthesis for multi-person households using generative models
- URL: http://arxiv.org/abs/2508.09964v1
- Date: Wed, 13 Aug 2025 17:31:45 GMT
- Title: Deep and diverse population synthesis for multi-person households using generative models
- Authors: Hai Yang, Hongying Wu, Linfei Yuan, Xiyuan Ren, Joseph Y. J. Chow, Jinqin Gao, Kaan Ozbay,
- Abstract summary: We apply a novel population synthesis model to generate a synthetic population for the whole New York State.<n>The synthetic population includes nearly 20 million individuals and 7.5 million households.<n>Compared to the census marginals, the synthetic population provides data that is 17% more diverse.
- Score: 4.321984653683312
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Synthetic population is an increasingly important material used in numerous areas such as urban and transportation analysis. Traditional methods such as iterative proportional fitting (IPF) is not capable of generating high-quality data when facing datasets with high dimension. Latest population synthesis methods using deep learning techniques can resolve such curse of dimensionality. However, few controls are placed when using these methods, and few of the methods are used to generate synthetic population capturing associations among members in one household. In this study, we propose a framework that tackles these issues. The framework uses a novel population synthesis model, called conditional input directed acyclic tabular generative adversarial network (ciDATGAN), as its core, and a basket of methods are employed to enhance the population synthesis performance. We apply the model to generate a synthetic population for the whole New York State as a public resource for researchers and policymakers. The synthetic population includes nearly 20 million individuals and 7.5 million households. The marginals obtained from the synthetic population match the census marginals well while maintaining similar associations among household members to the sample. Compared to the PUMS data, the synthetic population provides data that is 17% more diverse; when compared against a benchmark approach based on Popgen, the proposed method is 13% more diverse. This study provides an approach that encompasses multiple methods to enhance the population synthesis procedure with greater equity- and diversity-awareness.
Related papers
- Population Synthesis using Incomplete Information [0.0]
The paper presents a population synthesis model that utilizes the Wasserstein Generative-Adversarial Network (WGAN) for training on incomplete microsamples.<n>By using a mask matrix to represent missing values, the study proposes a WGAN training algorithm that lets the model learn from a training dataset that has some missing information.
arXiv Detail & Related papers (2025-10-01T13:09:14Z) - Population-Aligned Persona Generation for LLM-based Social Simulation [58.84363795421489]
We propose a systematic framework for synthesizing high-quality, population-aligned persona sets for social simulation.<n>Our approach begins by leveraging large language models to generate narrative personas from long-term social media data.<n>To address the needs of specific simulation contexts, we introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations.
arXiv Detail & Related papers (2025-09-12T10:43:47Z) - Generating Feasible and Diverse Synthetic Populations Using Diffusion Models [5.689443449061003]
Population synthesis is a critical task that involves generating synthetic yet realistic representations of populations.<n>Deep generative models can potentially synthesize possible attribute combinations that present in the actual population but do not exist in the sample data.<n>In this study, a novel diffusion model-based population synthesis method is proposed to estimate the underlying joint distribution of a population.
arXiv Detail & Related papers (2025-08-06T03:11:27Z) - A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.<n>Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.<n>We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z) - Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold [83.18058549195855]
We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities.<n>In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depend on the microenvironment of cells specific to each patient.<n>We propose Meta Flow Matching (MFM), a practical approach to integrate along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations.
arXiv Detail & Related papers (2024-08-26T20:05:31Z) - A multi-objective combinatorial optimisation framework for large scale hierarchical population synthesis [1.2233362977312945]
In agent-based simulations, synthetic populations of agents are commonly used to represent the structure, behaviour, and interactions of individuals.
We propose a multi objective optimisation technique for large scale population synthesis.
Our approach supports complex hierarchical structures between individuals and households, is scalable to large populations and achieves minimal contigency table reconstruction error.
arXiv Detail & Related papers (2024-07-03T15:01:12Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human
Generation [59.77275587857252]
A holistic human dataset inevitably has insufficient and low-resolution information on local parts.
We propose to use multi-source datasets with various resolution images to jointly learn a high-resolution human generative model.
arXiv Detail & Related papers (2023-09-25T17:58:46Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthpop++: A Hybrid Framework for Generating A Country-scale Synthetic Population [0.680303951699936]
Population censuses are costly, time-consuming, and may also raise privacy concerns.
We introduce SynthPop++, which can combine data from multiple real-world surveys to produce a real-scale synthetic population.
Our experimental results show that synthetic population can realistically simulate the population for various administrative units of India.
arXiv Detail & Related papers (2023-04-24T17:27:56Z) - Copula-based transferable models for synthetic population generation [1.370096215615823]
Population synthesis involves generating synthetic yet realistic representations of a target population of micro-agents.
Traditional methods, often reliant on target population samples, face limitations due to high costs and small sample sizes.
We propose a novel framework based on copulas to generate synthetic data for target populations where only empirical marginal distributions are known.
arXiv Detail & Related papers (2023-02-17T23:58:14Z) - Generating Synthetic Population [0.680303951699936]
We provide a method to generate synthetic population at various administrative levels for a country like India.
This synthetic population is created using machine learning and statistical methods applied to survey data such as Census of India 2011, IHDS-II, NSS-68th round, GPW etc.
arXiv Detail & Related papers (2022-09-20T19:31:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.