Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM
- URL: http://arxiv.org/abs/2307.07771v3
- Date: Tue, 18 Jun 2024 12:09:18 GMT
- Title: Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM
- Authors: Banghee So,
- Abstract summary: CatBoost is the best library for developing auto claim frequency models based on predictive performance.
We propose a new zero-inflated Poisson boosted tree model, with variation in the assumption about the relationship between inflation probability $p$ and distribution mean $mu$.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The property and casualty (P&C) insurance industry faces challenges in developing claim predictive models due to the highly right-skewed distribution of positive claims with excess zeros. To address this, actuarial science researchers have employed "zero-inflated" models that combine a traditional count model and a binary model. This paper investigates the use of boosting algorithms to process insurance claim data, including zero-inflated telematics data, to construct claim frequency models. Three popular gradient boosting libraries - XGBoost, LightGBM, and CatBoost - are evaluated and compared to determine the most suitable library for training insurance claim data and fitting actuarial frequency models. Through a comprehensive analysis of two distinct datasets, it is determined that CatBoost is the best for developing auto claim frequency models based on predictive performance. Furthermore, we propose a new zero-inflated Poisson boosted tree model, with variation in the assumption about the relationship between inflation probability $p$ and distribution mean $\mu$, and find that it outperforms others depending on data characteristics. This model enables us to take advantage of particular CatBoost tools, which makes it easier and more convenient to investigate the effects and interactions of various risk features on the frequency model when using telematics data.
Related papers
- Enhancing Crash Frequency Modeling Based on Augmented Multi-Type Data by Hybrid VAE-Diffusion-Based Generative Neural Networks [13.402051372401822]
A key challenge in crash frequency modelling is the prevalence of excessive zero observations.
We propose a hybrid VAE-Diffusion neural network, designed to reduce zero observations.
We assess the synthetic data quality generated by this model through metrics like similarity, accuracy, diversity, and structural consistency.
arXiv Detail & Related papers (2025-01-17T07:53:27Z) - From Point to probabilistic gradient boosting for claim frequency and severity prediction [1.3812010983144802]
We present a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms.
We compare their performance on five publicly available datasets for claim frequency and severity, of various size and comprising different number of (high cardinality) categorical variables.
arXiv Detail & Related papers (2024-12-19T14:50:10Z) - Zero-Inflated Tweedie Boosted Trees with CatBoost for Insurance Loss Analytics [0.8287206589886881]
We modify the Tweedie regression model to address its limitations in modeling aggregate claims for various types of insurance.
Our recommended approach involves a refined modeling of the zero-claim process, together with the integration of boosting methods.
Our modeling results reveal a marked improvement in model performance, showcasing its potential to deliver more accurate predictions.
arXiv Detail & Related papers (2024-06-23T20:03:55Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - Quantile Extreme Gradient Boosting for Uncertainty Quantification [1.7685947618629572]
Extreme Gradient Boosting (XGBoost) is one of the most popular machine learning (ML) methods.
We propose enhancements to XGBoost whereby a modified quantile regression is used as the objective function to estimate uncertainty (QXGBoost)
Our proposed method had comparable or better performance than the uncertainty estimates generated for regular and quantile light gradient boosting.
arXiv Detail & Related papers (2023-04-23T19:46:19Z) - Bayesian CART models for insurance claims frequency [0.0]
classification and regression trees (CARTs) and their ensembles have gained popularity in the actuarial literature.
We introduce Bayesian CART models for insurance pricing, with a particular focus on claims frequency modelling.
Some simulations and real insurance data will be discussed to illustrate the applicability of these models.
arXiv Detail & Related papers (2023-03-03T13:48:35Z) - Less is More: Mitigate Spurious Correlations for Open-Domain Dialogue
Response Generation Models by Causal Discovery [52.95935278819512]
We conduct the first study on spurious correlations for open-domain response generation models based on a corpus CGDIALOG curated in our work.
Inspired by causal discovery algorithms, we propose a novel model-agnostic method for training and inference of response generation model.
arXiv Detail & Related papers (2023-03-02T06:33:48Z) - Adaptive LASSO estimation for functional hidden dynamic geostatistical
model [69.10717733870575]
We propose a novel model selection algorithm based on a penalized maximum likelihood estimator (PMLE) for functional hiddenstatistical models (f-HD)
The algorithm is based on iterative optimisation and uses an adaptive least absolute shrinkage and selector operator (GMSOLAS) penalty function, wherein the weights are obtained by the unpenalised f-HD maximum-likelihood estimators.
arXiv Detail & Related papers (2022-08-10T19:17:45Z) - Learning Summary Statistics for Bayesian Inference with Autoencoders [58.720142291102135]
We use the inner dimension of deep neural network based Autoencoders as summary statistics.
To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information that has been used to generate the training data.
arXiv Detail & Related papers (2022-01-28T12:00:31Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.