Related papers: Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM

Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM

URL: http://arxiv.org/abs/2307.07771v3
Date: Tue, 18 Jun 2024 12:09:18 GMT
Title: Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM
Authors: Banghee So,
Abstract summary: CatBoost is the best library for developing auto claim frequency models based on predictive performance. We propose a new zero-inflated Poisson boosted tree model, with variation in the assumption about the relationship between inflation probability $p$ and distribution mean $mu$.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The property and casualty (P&C) insurance industry faces challenges in developing claim predictive models due to the highly right-skewed distribution of positive claims with excess zeros. To address this, actuarial science researchers have employed "zero-inflated" models that combine a traditional count model and a binary model. This paper investigates the use of boosting algorithms to process insurance claim data, including zero-inflated telematics data, to construct claim frequency models. Three popular gradient boosting libraries - XGBoost, LightGBM, and CatBoost - are evaluated and compared to determine the most suitable library for training insurance claim data and fitting actuarial frequency models. Through a comprehensive analysis of two distinct datasets, it is determined that CatBoost is the best for developing auto claim frequency models based on predictive performance. Furthermore, we propose a new zero-inflated Poisson boosted tree model, with variation in the assumption about the relationship between inflation probability $p$ and distribution mean $\mu$, and find that it outperforms others depending on data characteristics. This model enables us to take advantage of particular CatBoost tools, which makes it easier and more convenient to investigate the effects and interactions of various risk features on the frequency model when using telematics data.

Related papers

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
Explainable Boosting Machine for Predicting Claim Severity and Frequency in Car Insurance [0.0]
We introduce an Explainable Boosting Machine (EBM) model that combines intrinsically interpretable characteristics and high prediction performance. We implement this approach on car insurance frequency and severity data and extensively compare the performance of this approach with classical competitors.
arXiv Detail & Related papers (2025-03-27T09:59:45Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Enhancing Crash Frequency Modeling Based on Augmented Multi-Type Data by Hybrid VAE-Diffusion-Based Generative Neural Networks [13.402051372401822]
A key challenge in crash frequency modelling is the prevalence of excessive zero observations. We propose a hybrid VAE-Diffusion neural network, designed to reduce zero observations. We assess the synthetic data quality generated by this model through metrics like similarity, accuracy, diversity, and structural consistency.
arXiv Detail & Related papers (2025-01-17T07:53:27Z)
From Point to probabilistic gradient boosting for claim frequency and severity prediction [1.3812010983144802]
We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms. We compare their performance on five publicly available datasets for claim frequency and severity. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.
arXiv Detail & Related papers (2024-12-19T14:50:10Z)
Supervised Score-Based Modeling by Gradient Boosting [49.556736252628745]
We propose a Supervised Score-based Model (SSM) which can be viewed as a gradient boosting algorithm combining score matching. We provide a theoretical analysis of learning and sampling for SSM to balance inference time and prediction accuracy. Our model outperforms existing models in both accuracy and inference time.
arXiv Detail & Related papers (2024-11-02T07:06:53Z)
Learning Augmentation Policies from A Model Zoo for Time Series Forecasting [58.66211334969299]
We introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning. By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance.
arXiv Detail & Related papers (2024-09-10T07:34:19Z)
Zero-Inflated Tweedie Boosted Trees with CatBoost for Insurance Loss Analytics [0.8287206589886881]
We modify the Tweedie regression model to address its limitations in modeling aggregate claims for various types of insurance. Our recommended approach involves a refined modeling of the zero-claim process, together with the integration of boosting methods. Our modeling results reveal a marked improvement in model performance, showcasing its potential to deliver more accurate predictions.
arXiv Detail & Related papers (2024-06-23T20:03:55Z)
Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT) CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction. We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z)
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces. We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z)
Quantile Extreme Gradient Boosting for Uncertainty Quantification [1.7685947618629572]
Extreme Gradient Boosting (XGBoost) is one of the most popular machine learning (ML) methods. We propose enhancements to XGBoost whereby a modified quantile regression is used as the objective function to estimate uncertainty (QXGBoost) Our proposed method had comparable or better performance than the uncertainty estimates generated for regular and quantile light gradient boosting.
arXiv Detail & Related papers (2023-04-23T19:46:19Z)
Bayesian CART models for insurance claims frequency [0.0]
classification and regression trees (CARTs) and their ensembles have gained popularity in the actuarial literature. We introduce Bayesian CART models for insurance pricing, with a particular focus on claims frequency modelling. Some simulations and real insurance data will be discussed to illustrate the applicability of these models.
arXiv Detail & Related papers (2023-03-03T13:48:35Z)
Adaptive LASSO estimation for functional hidden dynamic geostatistical model [69.10717733870575]
We propose a novel model selection algorithm based on a penalized maximum likelihood estimator (PMLE) for functional hiddenstatistical models (f-HD) The algorithm is based on iterative optimisation and uses an adaptive least absolute shrinkage and selector operator (GMSOLAS) penalty function, wherein the weights are obtained by the unpenalised f-HD maximum-likelihood estimators.
arXiv Detail & Related papers (2022-08-10T19:17:45Z)
Learning Summary Statistics for Bayesian Inference with Autoencoders [58.720142291102135]
We use the inner dimension of deep neural network based Autoencoders as summary statistics. To create an incentive for the encoder to encode all the parameter-related information but not the noise, we give the decoder access to explicit or implicit information that has been used to generate the training data.
arXiv Detail & Related papers (2022-01-28T12:00:31Z)
X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning. To take the power of both worlds, we propose a novel X-model. X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z)
Synthetic Dataset Generation of Driver Telematics [0.0]
This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. It follows a three-stage process using machine learning algorithms. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data.
arXiv Detail & Related papers (2021-01-30T15:52:56Z)
When stakes are high: balancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suRRogates [0.0]
Highly regulated industries, like banking and insurance, ask for transparent decision-making algorithms. We present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRogate (maidrr) Knowledge is extracted from a black box via partial dependence effects. This results in a segmentation of the feature space with automatic variable selection.
arXiv Detail & Related papers (2020-07-14T08:10:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.