Related papers: Amputation-imputation based generation of synthetic tabular data for ratemaking

Amputation-imputation based generation of synthetic tabular data for ratemaking

URL: http://arxiv.org/abs/2509.02171v1
Date: Tue, 02 Sep 2025 10:23:04 GMT
Title: Amputation-imputation based generation of synthetic tabular data for ratemaking
Authors: Yevhen Havrylenko, Meelis Käärik, Artur Tuttar,
Abstract summary: Actuarial ratemaking depends on high-quality data, yet access to such data is often limited by the cost of obtaining new data, privacy concerns, etc.<n>In this paper, we explore synthetic-data generation as a potential solution to these issues.<n>We present a comparative study using an open-source dataset and evaluating MICE-based models against other generative models like Variational Autoencoders and Conditional Tabular Generative Adversarial Networks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Actuarial ratemaking depends on high-quality data, yet access to such data is often limited by the cost of obtaining new data, privacy concerns, etc. In this paper, we explore synthetic-data generation as a potential solution to these issues. In addition to discussing generative methods previously studied in the actuarial literature, we introduce to the insurance community another approach based on Multiple Imputation by Chained Equations (MICE). We present a comparative study using an open-source dataset and evaluating MICE-based models against other generative models like Variational Autoencoders and Conditional Tabular Generative Adversarial Networks. We assess how well synthetic data preserves the original marginal distributions of variables as well as the multivariate relationships among covariates. We also investigate the consistency between Generalized Linear Models (GLMs) trained on synthetic data with GLMs trained on the original data. Furthermore, we assess the ease of use of each generative approach and study the impact of augmenting original data with synthetic data on the performance of GLMs for predicting claim counts. Our results highlight the potential of MICE-based methods in creating high-quality tabular data while being more user-friendly than the other methods.

Related papers

Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z)
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks [7.780592134085148]
Gene Likelihood Ratio Attack (Gen-LRA)<n>Gen-LRA formulates its attack by evaluating the influence a test observation has in a surrogate model's estimation of a local likelihood ratio over the synthetic data.<n>Results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data.
arXiv Detail & Related papers (2025-08-28T18:27:40Z)
Using Imperfect Synthetic Data in Downstream Inference Tasks [50.40949503799331]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z)
Less is More: Adaptive Coverage for Synthetic Training Data [20.136698279893857]
This study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset.<n>Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset.
arXiv Detail & Related papers (2025-04-20T06:45:16Z)
Assessing Generative Models for Structured Data [0.0]
This paper introduces rigorous methods for assessing synthetic data against real data by looking at inter-column dependencies within the data.<n>We find that large language models (GPT-2), both when queried via few-shot prompting, and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data.
arXiv Detail & Related papers (2025-03-26T18:19:05Z)
A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice.<n>Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities.<n>We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z)
Large Language Models for Market Research: A Data-augmentation Approach [3.3199591445531453]
Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks.<n>Recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two.<n>We propose a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis.
arXiv Detail & Related papers (2024-12-26T22:06:29Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation [0.0]
We provide new algorithms for two tasks relating to heterogeneous datasets: clustering, and synthetic data generation. We demonstrate a novel EM-based clustering algorithm, MMM, that outperforms standard algorithms in determining clusters in synthetic heterogeneous data. We also demonstrate a synthetic data generation algorithm, MMMsynth, that pre-clusters the input data, and generates cluster-wise synthetic data.
arXiv Detail & Related papers (2023-10-30T11:26:01Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.