Enhancing Diversity and Feasibility: Joint Population Synthesis from Multi-source Data Using Generative Models
- URL: http://arxiv.org/abs/2602.15270v1
- Date: Tue, 17 Feb 2026 00:02:30 GMT
- Title: Enhancing Diversity and Feasibility: Joint Population Synthesis from Multi-source Data Using Generative Models
- Authors: Farbod Abbasi, Zachary Patterson, Bilal Farooq,
- Abstract summary: This study proposes a novel method to simultaneously integrate and synthesize multi-source datasets using a Wasserstein Generative Adversarial Network (WGAN) with gradient penalty.<n>Results show that the proposed joint approach outperforms the sequential baseline, with recall increasing by 7% and precision by 15%.<n>Since synthetic populations serve as a key input for agent-based models (ABM), this multi-source generative approach has the potential to significantly enhance the accuracy and reliability of ABM.
- Score: 4.73459038844245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating realistic synthetic populations is essential for agent-based models (ABM) in transportation and urban planning. Current methods face two major limitations. First, many rely on a single dataset or follow a sequential data fusion and generation process, which means they fail to capture the complex interplay between features. Second, these approaches struggle with sampling zeros (valid but unobserved attribute combinations) and structural zeros (infeasible combinations due to logical constraints), which reduce the diversity and feasibility of the generated data. This study proposes a novel method to simultaneously integrate and synthesize multi-source datasets using a Wasserstein Generative Adversarial Network (WGAN) with gradient penalty. This joint learning method improves both the diversity and feasibility of synthetic data by defining a regularization term (inverse gradient penalty) for the generator loss function. For the evaluation, we implement a unified evaluation metric for similarity, and place special emphasis on measuring diversity and feasibility through recall, precision, and the F1 score. Results show that the proposed joint approach outperforms the sequential baseline, with recall increasing by 7\% and precision by 15\%. Additionally, the regularization term further improves diversity and feasibility, reflected in a 10\% increase in recall and 1\% in precision. We assess similarity distributions using a five-metric score. The joint approach performs better overall, and reaches a score of 88.1 compared to 84.6 for the sequential method. Since synthetic populations serve as a key input for ABM, this multi-source generative approach has the potential to significantly enhance the accuracy and reliability of ABM.
Related papers
- Do Generative Metrics Predict YOLO Performance? An Evaluation Across Models, Augmentation Ratios, and Dataset Complexity [43.338311770275745]
We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes.<n>We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split.<n>For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol.
arXiv Detail & Related papers (2026-02-20T03:02:36Z) - MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources [113.33902847941941]
Variance-Aware Sampling (VAS) is a data selection strategy guided by Variance Promotion Score (VPS)<n>We release large-scale, carefully curated resources containing 1.6M long CoT cold-start data and 15k RL QA pairs.<n> Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS.
arXiv Detail & Related papers (2025-09-25T14:58:29Z) - Generating Feasible and Diverse Synthetic Populations Using Diffusion Models [5.689443449061003]
Population synthesis is a critical task that involves generating synthetic yet realistic representations of populations.<n>Deep generative models can potentially synthesize possible attribute combinations that present in the actual population but do not exist in the sample data.<n>In this study, a novel diffusion model-based population synthesis method is proposed to estimate the underlying joint distribution of a population.
arXiv Detail & Related papers (2025-08-06T03:11:27Z) - Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z) - Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy [0.9999629695552196]
The present work develops and validates a data-driven and interpretable machine-learning framework designed to predict strokes.<n>Ten routinely gathered demographic, lifestyle, and clinical variables were sourced from a public cohort of 4,981 records.<n>The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model.
arXiv Detail & Related papers (2025-05-18T21:46:45Z) - A Large Language Model for Feasible and Diverse Population Synthesis [0.6581049960856515]
We propose a fine-tuning method for large language models (LLMs) that explicitly controls the autoregressive generation process through topological orderings derived from a Bayesian Network (BN)<n>Our approach achieves approximately 95% feasibility, significantly higher than the 80% observed in deep generative models (DGMs)<n>This makes the approach cost-effective and scalable for large-scale applications, such as synthesizing populations in megacities.
arXiv Detail & Related papers (2025-05-07T07:50:12Z) - Discrete Flow Matching [74.04153927689313]
We present a novel discrete flow paradigm designed specifically for generating discrete data.
Our approach is capable of generating high-quality discrete data in a non-autoregressive fashion.
arXiv Detail & Related papers (2024-07-22T12:33:27Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Personalized Federated Learning under Mixture of Distributions [98.25444470990107]
We propose a novel approach to Personalized Federated Learning (PFL), which utilizes Gaussian mixture models (GMM) to fit the input data distributions across diverse clients.
FedGMM possesses an additional advantage of adapting to new clients with minimal overhead, and it also enables uncertainty quantification.
Empirical evaluations on synthetic and benchmark datasets demonstrate the superior performance of our method in both PFL classification and novel sample detection.
arXiv Detail & Related papers (2023-05-01T20:04:46Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.