Synthesizing Property & Casualty Ratemaking Datasets using Generative
Adversarial Networks
- URL: http://arxiv.org/abs/2008.06110v1
- Date: Thu, 13 Aug 2020 21:02:44 GMT
- Title: Synthesizing Property & Casualty Ratemaking Datasets using Generative
Adversarial Networks
- Authors: Marie-Pier Cote, Brian Hartman, Olivier Mercier, Joshua Meyers, Jared
Cummings, Elijah Harmon
- Abstract summary: We show how to design three types of generative adversarial networks (GANs) that can build a synthetic insurance dataset from a confidential original dataset.
For transparency, the approaches are illustrated using a public dataset, the French motor third party liability data.
We find that the MC-WGAN-GP synthesizes the best data, the CTGAN is the easiest to use, and the MNCDP-GAN guarantees differential privacy.
- Score: 2.2649197740853677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to confidentiality issues, it can be difficult to access or share
interesting datasets for methodological development in actuarial science, or
other fields where personal data are important. We show how to design three
different types of generative adversarial networks (GANs) that can build a
synthetic insurance dataset from a confidential original dataset. The goal is
to obtain synthetic data that no longer contains sensitive information but
still has the same structure as the original dataset and retains the
multivariate relationships. In order to adequately model the specific
characteristics of insurance data, we use GAN architectures adapted for
multi-categorical data: a Wassertein GAN with gradient penalty (MC-WGAN-GP), a
conditional tabular GAN (CTGAN) and a Mixed Numerical and Categorical
Differentially Private GAN (MNCDP-GAN). For transparency, the approaches are
illustrated using a public dataset, the French motor third party liability
data. We compare the three different GANs on various aspects: ability to
reproduce the original data structure and predictive models, privacy, and ease
of use. We find that the MC-WGAN-GP synthesizes the best data, the CTGAN is the
easiest to use, and the MNCDP-GAN guarantees differential privacy.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - An improved tabular data generator with VAE-GMM integration [9.4491536689161]
We propose a novel Variational Autoencoder (VAE)-based model that addresses limitations of current approaches.
Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture.
We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones.
arXiv Detail & Related papers (2024-04-12T12:31:06Z) - Fake It Till Make It: Federated Learning with Consensus-Oriented
Generation [52.82176415223988]
We propose federated learning with consensus-oriented generation (FedCOG)
FedCOG consists of two key components at the client side: complementary data generation and knowledge-distillation-based model training.
Experiments on classical and real-world FL datasets show that FedCOG consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T18:49:59Z) - Federated Learning Empowered by Generative Content [55.576885852501775]
Federated learning (FL) enables leveraging distributed private data for model training in a privacy-preserving way.
We propose a novel FL framework termed FedGC, designed to mitigate data heterogeneity issues by diversifying private data with generative content.
We conduct a systematic empirical study on FedGC, covering diverse baselines, datasets, scenarios, and modalities.
arXiv Detail & Related papers (2023-12-10T07:38:56Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - CTAB-GAN: Effective Table Data Synthesizing [7.336728307626645]
We develop CTAB-GAN, a conditional table GAN architecture that can model diverse data types.
We show that CTAB-GAN remarkably resembles the real data for all three types of variables and results into higher accuracy for five machine learning algorithms, by up 17%.
arXiv Detail & Related papers (2021-02-16T18:53:57Z) - imdpGAN: Generating Private and Specific Data with Generative
Adversarial Networks [19.377726080729293]
imdpGAN is an end-to-end framework that simultaneously achieves privacy protection and learns latent representations.
We show that imdpGAN preserves the privacy of the individual data point, and learns latent codes to control the specificity of the generated samples.
arXiv Detail & Related papers (2020-09-29T08:03:32Z) - GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially
Private Generators [74.16405337436213]
We propose Gradient-sanitized Wasserstein Generative Adrial Networks (GS-WGAN)
GS-WGAN allows releasing a sanitized form of sensitive data with rigorous privacy guarantees.
We find our approach consistently outperforms state-of-the-art approaches across multiple metrics.
arXiv Detail & Related papers (2020-06-15T10:01:01Z) - DP-CGAN: Differentially Private Synthetic Data and Label Generation [18.485995499841]
We introduce a Differentially Private Conditional GAN (DP-CGAN) training framework based on a new clipping and perturbation strategy.
We show that DP-CGAN can generate visually and empirically promising results on the MNIST dataset with a single-digit epsilon parameter in differential privacy.
arXiv Detail & Related papers (2020-01-27T11:26:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.