Regularized target encoding outperforms traditional methods in
supervised machine learning with high cardinality features
- URL: http://arxiv.org/abs/2104.00629v1
- Date: Thu, 1 Apr 2021 17:21:42 GMT
- Title: Regularized target encoding outperforms traditional methods in
supervised machine learning with high cardinality features
- Authors: Florian Pargent, Florian Pfisterer, Janek Thomas, Bernd Bischl
- Abstract summary: We study techniques that yield numeric representations of categorical variables.
We compare different encoding strategies together with five machine learning algorithms.
Regularized versions of target encoding consistently provided the best results.
- Score: 1.1709030738577393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Because most machine learning (ML) algorithms are designed for numerical
inputs, efficiently encoding categorical variables is a crucial aspect during
data analysis. An often encountered problem are high cardinality features, i.e.
unordered categorical predictor variables with a high number of levels. We
study techniques that yield numeric representations of categorical variables
which can then be used in subsequent ML applications. We focus on the impact of
those techniques on a subsequent algorithm's predictive performance, and -- if
possible -- derive best practices on when to use which technique. We conducted
a large-scale benchmark experiment, where we compared different encoding
strategies together with five ML algorithms (lasso, random forest, gradient
boosting, k-nearest neighbours, support vector machine) using datasets from
regression, binary- and multiclass- classification settings. Throughout our
study, regularized versions of target encoding (i.e. using target predictions
based on the feature levels in the training set as a new numerical feature)
consistently provided the best results. Traditional encodings that make
unreasonable assumptions to map levels to integers (e.g. integer encoding) or
to reduce the number of levels (possibly based on target information, e.g. leaf
encoding) before creating binary indicator variables (one-hot or dummy
encoding) were not as effective.
Related papers
- Label Encoding for Regression Networks [9.386028796990399]
We introduce binary-encoded labels (BEL), which generalizes the application of binary classification to regression.
BEL achieves state-of-the-art accuracies for several regression benchmarks.
arXiv Detail & Related papers (2022-12-04T21:23:36Z) - Towards Better Out-of-Distribution Generalization of Neural Algorithmic
Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks.
The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z) - Efficient Syndrome Decoder for Heavy Hexagonal QECC via Machine Learning [1.1156329459915602]
Recent advances have shown that topological codes can be efficiently decoded by deploying machine learning (ML) techniques.
We first propose an ML based decoder for heavy hexagonal code and establish its efficiency in terms of the values of threshold and pseudo-threshold.
A novel technique based on rank to determine the equivalent error classes is presented, which is empirically faster than the one based on linear search.
arXiv Detail & Related papers (2022-10-18T10:16:14Z) - Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective [67.45111837188685]
Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data.
We experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning.
arXiv Detail & Related papers (2022-06-16T11:44:11Z) - Variational Sparse Coding with Learned Thresholding [6.737133300781134]
We propose a new approach to variational sparse coding that allows us to learn sparse distributions by thresholding samples.
We first evaluate and analyze our method by training a linear generator, showing that it has superior performance, statistical efficiency, and gradient estimation.
arXiv Detail & Related papers (2022-05-07T14:49:50Z) - Efficient and Differentiable Conformal Prediction with General Function
Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters.
We show that it achieves approximate valid population coverage and near-optimal efficiency within class.
Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z) - PCA-based Category Encoder for Categorical to Numerical Variable
Conversion [1.1156827035309407]
Increasing the cardinality of categorical variables might decrease the overall performance of machine learning (ML) algorithms.
This paper presents a novel computational preprocessing method to convert categorical to numerical variables.
The proposed technique achieved the highest performance related to accuracy and Area under the curve (AUC) on high cardinality categorical variables.
arXiv Detail & Related papers (2021-11-29T12:49:20Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z) - Fast Few-Shot Classification by Few-Iteration Meta-Learning [173.32497326674775]
We introduce a fast optimization-based meta-learning method for few-shot classification.
Our strategy enables important aspects of the base learner objective to be learned during meta-training.
We perform a comprehensive experimental analysis, demonstrating the speed and effectiveness of our approach.
arXiv Detail & Related papers (2020-10-01T15:59:31Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.