Synthetic Dataset Generation of Driver Telematics
- URL: http://arxiv.org/abs/2102.00252v1
- Date: Sat, 30 Jan 2021 15:52:56 GMT
- Title: Synthetic Dataset Generation of Driver Telematics
- Authors: Banghee So, Jean-Philippe Boucher, Emiliano A. Valdez
- Abstract summary: This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset.
It follows a three-stage process using machine learning algorithms.
The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article describes techniques employed in the production of a synthetic
dataset of driver telematics emulated from a similar real insurance dataset.
The synthetic dataset generated has 100,000 policies that included observations
about driver's claims experience together with associated classical risk
variables and telematics-related variables. This work is aimed to produce a
resource that can be used to advance models to assess risks for usage-based
insurance. It follows a three-stage process using machine learning algorithms.
The first stage is simulating values for the number of claims as multiple
binary classifications applying feedforward neural networks. The second stage
is simulating values for aggregated amount of claims as regression using
feedforward neural networks, with number of claims included in the set of
feature variables. In the final stage, a synthetic portfolio of the space of
feature variables is generated applying an extended $\texttt{SMOTE}$ algorithm.
The resulting dataset is evaluated by comparing the synthetic and real datasets
when Poisson and gamma regression models are fitted to the respective data.
Other visualization and data summarization produce remarkable similar
statistics between the two datasets. We hope that researchers interested in
obtaining telematics datasets to calibrate models or learning algorithms will
find our work valuable.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - CARLA-GeAR: a Dataset Generator for a Systematic Evaluation of
Adversarial Robustness of Vision Models [61.68061613161187]
This paper presents CARLA-GeAR, a tool for the automatic generation of synthetic datasets for evaluating the robustness of neural models against physical adversarial patches.
The tool is built on the CARLA simulator, using its Python API, and allows the generation of datasets for several vision tasks in the context of autonomous driving.
The paper presents an experimental study to evaluate the performance of some defense methods against such attacks, showing how the datasets generated with CARLA-GeAR might be used in future work as a benchmark for adversarial defense in the real world.
arXiv Detail & Related papers (2022-06-09T09:17:38Z) - Learning Summary Statistics for Bayesian Inference with Autoencoders [58.720142291102135]
We use the inner dimension of deep neural network based Autoencoders as summary statistics.
To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information that has been used to generate the training data.
arXiv Detail & Related papers (2022-01-28T12:00:31Z) - Bayesian Topic Regression for Causal Inference [3.9082355007261427]
Causal inference using observational text data is becoming increasingly popular in many research areas.
This paper presents the Bayesian Topic Regression model that uses both text and numerical information to model an outcome variable.
arXiv Detail & Related papers (2021-09-11T16:40:43Z) - MLReal: Bridging the gap between training on synthetic data and real
data applications in machine learning [1.9852463786440129]
We describe a novel approach to enhance supervised training on synthetic data with real data features.
In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain.
In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
arXiv Detail & Related papers (2021-09-11T14:43:34Z) - Towards Synthetic Multivariate Time Series Generation for Flare
Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest.
In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.