Related papers: When stakes are high: balancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suRRogates

When stakes are high: balancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suRRogates

URL: http://arxiv.org/abs/2007.06894v2
Date: Thu, 10 Dec 2020 17:44:03 GMT
Title: When stakes are high: balancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suRRogates
Authors: Roel Henckaerts and Katrien Antonio and Marie-Pier C\^ot\'e
Abstract summary: Highly regulated industries, like banking and insurance, ask for transparent decision-making algorithms. We present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRogate (maidrr) Knowledge is extracted from a black box via partial dependence effects. This results in a segmentation of the feature space with automatic variable selection.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Highly regulated industries, like banking and insurance, ask for transparent decision-making algorithms. At the same time, competitive markets are pushing for the use of complex black box models. We therefore present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRogate (maidrr) suited for structured tabular data. Knowledge is extracted from a black box via partial dependence effects. These are used to perform smart feature engineering by grouping variable values. This results in a segmentation of the feature space with automatic variable selection. A transparent generalized linear model (GLM) is fit to the features in categorical format and their relevant interactions. We demonstrate our R package maidrr with a case study on general insurance claim frequency modeling for six publicly available datasets. Our maidrr GLM closely approximates a gradient boosting machine (GBM) black box and outperforms both a linear and tree surrogate as benchmarks.

Related papers

Can Moran Eigenvectors Improve Machine Learning of Spatial Data? Insights from Synthetic Data Validation [2.9388890036358104]
This paper examines the effectiveness of using Moran Eigenvectors as additional spatial features in machine learning models. We generate synthetic datasets with known processes involving spatially varying and nonlinear effects across two different geometries. Results show that machine learning models using only location coordinates achieve better accuracies than eigenvector-based approaches across various experiments and datasets.
arXiv Detail & Related papers (2025-04-16T19:31:42Z)
Explainable Boosting Machine for Predicting Claim Severity and Frequency in Car Insurance [0.0]
We introduce an Explainable Boosting Machine (EBM) model that combines intrinsically interpretable characteristics and high prediction performance. We implement this approach on car insurance frequency and severity data and extensively compare the performance of this approach with classical competitors.
arXiv Detail & Related papers (2025-03-27T09:59:45Z)
Debiased Prompt Tuning in Vision-Language Model without Annotations [14.811475313694041]
Vision-Language Models (VLMs) may suffer from the problem of spurious correlations. By leveraging pseudo-spurious attribute annotations, we propose a method to automatically adjust the training weights of different groups. Our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets.
arXiv Detail & Related papers (2025-03-11T12:24:54Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs. We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms. Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model [75.750699619993]
We propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously.
arXiv Detail & Related papers (2024-11-29T07:00:18Z)
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation [55.87169702896249]
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. We propose a framework to evaluate DA methods and present a fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications.
arXiv Detail & Related papers (2024-07-16T12:52:29Z)
Identifying Light-curve Signals with a Deep Learning Based Object Detection Algorithm. II. A General Light Curve Classification Framework [0.0]
We present a novel deep learning framework for classifying light curves using a weakly supervised object detection model. Our framework identifies the optimal windows for both light curves and power spectra automatically, and zooms in on their corresponding data. We train our model on datasets obtained from both space-based and ground-based multi-band observations of variable stars and transients.
arXiv Detail & Related papers (2023-11-14T11:08:34Z)
CELDA: Leveraging Black-box Language Model as Enhanced Classifier without Labels [14.285609493077965]
Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal. Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels.
arXiv Detail & Related papers (2023-06-05T08:35:31Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
GMMSeg: Gaussian Mixture based Generative Semantic Segmentation Models [74.0430727476634]
We propose a new family of segmentation models that rely on a dense generative classifier for the joint distribution p(pixel feature,class) With a variety of segmentation architectures and backbones, GMMSeg outperforms the discriminative counterparts on closed-set datasets. GMMSeg even performs well on open-world datasets.
arXiv Detail & Related papers (2022-10-05T05:20:49Z)
Interpreting Black-box Machine Learning Models for High Dimensional Datasets [40.09157165704895]
We train a black-box model on a high-dimensional dataset to learn the embeddings on which the classification is performed. We then approximate the behavior of the black-box model by means of an interpretable surrogate model on the top-k feature space. Our approach outperforms state-of-the-art methods like TabNet and XGboost when tested on different datasets.
arXiv Detail & Related papers (2022-08-29T07:36:17Z)
Self-service Data Classification Using Interactive Visualization and Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm. IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain. This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z)
Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes. Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z)
Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models. We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs. Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z)
Interpretabilit\'e des mod\`eles : \'etat des lieux des m\'ethodes et application \`a l'assurance [1.6058099298620423]
Data is the raw material of many models today make it possible to increase the quality and performance of digital services. Models users must ensure that models do not discriminate against and that it is also possible to explain its result. The widening of the panel of predictive algorithms leads scientists to be vigilant about the use of models.
arXiv Detail & Related papers (2020-07-25T12:18:07Z)
Semi-Supervised Learning with Normalizing Flows [54.376602201489995]
FlowGMM is an end-to-end approach to generative semi supervised learning with normalizing flows. We show promising results on a wide range of applications, including AG-News and Yahoo Answers text data.
arXiv Detail & Related papers (2019-12-30T17:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.