Related papers: Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

URL: http://arxiv.org/abs/2104.00629v1
Date: Thu, 1 Apr 2021 17:21:42 GMT
Title: Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features
Authors: Florian Pargent, Florian Pfisterer, Janek Thomas, Bernd Bischl
Abstract summary: We study techniques that yield numeric representations of categorical variables. We compare different encoding strategies together with five machine learning algorithms. Regularized versions of target encoding consistently provided the best results.
Score: 1.1709030738577393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Because most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect during data analysis. An often encountered problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of those techniques on a subsequent algorithm's predictive performance, and -- if possible -- derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbours, support vector machine) using datasets from regression, binary- and multiclass- classification settings. Throughout our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditional encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective.

Related papers

Linearly Convergent Mixup Learning [0.0]
We present two novel algorithms that extend to a broader range of binary classification models. Unlike gradient-based approaches, our algorithms do not require hyper parameters like learning rates, simplifying their implementation and optimization. Our algorithms achieve faster convergence to the optimal solution compared to descent gradient approaches, and that mixup data augmentation consistently improves the predictive performance across various loss functions.
arXiv Detail & Related papers (2025-01-14T02:33:40Z)
From Point to probabilistic gradient boosting for claim frequency and severity prediction [1.3812010983144802]
We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms. We compare their performance on five publicly available datasets for claim frequency and severity. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.
arXiv Detail & Related papers (2024-12-19T14:50:10Z)
Label Encoding for Regression Networks [9.386028796990399]
We introduce binary-encoded labels (BEL), which generalizes the application of binary classification to regression. BEL achieves state-of-the-art accuracies for several regression benchmarks.
arXiv Detail & Related papers (2022-12-04T21:23:36Z)
Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks. The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z)
Efficient Syndrome Decoder for Heavy Hexagonal QECC via Machine Learning [1.1156329459915602]
Recent advances have shown that topological codes can be efficiently decoded by deploying machine learning (ML) techniques. We first propose an ML based decoder for heavy hexagonal code and establish its efficiency in terms of the values of threshold and pseudo-threshold. A novel technique based on rank to determine the equivalent error classes is presented, which is empirically faster than the one based on linear search.
arXiv Detail & Related papers (2022-10-18T10:16:14Z)
Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective [67.45111837188685]
Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data. We experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning.
arXiv Detail & Related papers (2022-06-16T11:44:11Z)
Variational Sparse Coding with Learned Thresholding [6.737133300781134]
We propose a new approach to variational sparse coding that allows us to learn sparse distributions by thresholding samples. We first evaluate and analyze our method by training a linear generator, showing that it has superior performance, statistical efficiency, and gradient estimation.
arXiv Detail & Related papers (2022-05-07T14:49:50Z)
Efficient and Differentiable Conformal Prediction with General Function Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters. We show that it achieves approximate valid population coverage and near-optimal efficiency within class. Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z)
PCA-based Category Encoder for Categorical to Numerical Variable Conversion [1.1156827035309407]
Increasing the cardinality of categorical variables might decrease the overall performance of machine learning (ML) algorithms. This paper presents a novel computational preprocessing method to convert categorical to numerical variables. The proposed technique achieved the highest performance related to accuracy and Area under the curve (AUC) on high cardinality categorical variables.
arXiv Detail & Related papers (2021-11-29T12:49:20Z)
Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification. Our analysis reveals that the classification accuracy is highly distribution-dependent. The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z)
Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification. Our strategy enables important aspects of the base learner objective to be learned during meta-training. We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.