Quantile Encoder: Tackling High Cardinality Categorical Features in
Regression Problems
- URL: http://arxiv.org/abs/2105.13783v1
- Date: Thu, 27 May 2021 11:56:13 GMT
- Title: Quantile Encoder: Tackling High Cardinality Categorical Features in
Regression Problems
- Authors: Carlos Mougan, David Masip, Jordi Nin, Oriol Pujol
- Abstract summary: We provide an in-depth analysis of how to tackle high cardinality categor-ical features with the quantile.
Our proposal outperforms state-of-the-encoders, including the traditional statistical mean target encoder.
We also describe how to expand the encoded values by creating a set of features with different quantiles.
- Score: 2.3322477552758234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Regression problems have been widely studied in machinelearning literature
resulting in a plethora of regression models and performance measures. However,
there are few techniques specially dedicated to solve the problem of how to
incorporate categorical features to regression problems. Usually, categorical
feature encoders are general enough to cover both classification and regression
problems. This lack of specificity results in underperforming regression
models. In this paper,we provide an in-depth analysis of how to tackle high
cardinality categor-ical features with the quantile. Our proposal outperforms
state-of-the-encoders, including the traditional statistical mean target
encoder, when considering the Mean Absolute Error, especially in the presence
of long-tailed or skewed distributions. Besides, to deal with possible
overfitting when there are categories with small support, our encoder benefits
from additive smoothing. Finally, we describe how to expand the encoded values
by creating a set of features with different quantiles. This expanded encoder
provides a more informative output about the categorical feature in question,
further boosting the performance of the regression model.
Related papers
- Generalization bounds for regression and classification on adaptive covering input domains [1.4141453107129398]
We focus on the generalization bound, which serves as an upper limit for the generalization error.
In the case of classification tasks, we treat the target function as a one-hot, a piece-wise constant function, and employ 0/1 loss for error measurement.
arXiv Detail & Related papers (2024-07-29T05:40:08Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Robust Capped lp-Norm Support Vector Ordinal Regression [85.84718111830752]
Ordinal regression is a specialized supervised problem where the labels show an inherent order.
Support Vector Ordinal Regression, as an outstanding ordinal regression model, is widely used in many ordinal regression tasks.
We introduce a new model, Capped $ell_p$-Norm Support Vector Ordinal Regression(CSVOR), that is robust to outliers.
arXiv Detail & Related papers (2024-04-25T13:56:05Z) - An Ordinal Regression Framework for a Deep Learning Based Severity
Assessment for Chest Radiographs [50.285682227571996]
We propose a framework that divides the ordinal regression problem into three parts: a model, a target function, and a classification function.
We show that the choice of encoding has a strong impact on performance and that the best encoding depends on the chosen weighting of Cohen's kappa.
arXiv Detail & Related papers (2024-02-08T14:00:45Z) - Deep Imbalanced Regression via Hierarchical Classification Adjustment [50.19438850112964]
Regression tasks in computer vision are often formulated into classification by quantizing the target space into classes.
The majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range.
We propose to construct hierarchical classifiers for solving imbalanced regression tasks.
Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks.
arXiv Detail & Related papers (2023-10-26T04:54:39Z) - Entropy optimized semi-supervised decomposed vector-quantized
variational autoencoder model based on transfer learning for multiclass text
classification and generation [3.9318191265352196]
We propose a semisupervised discrete latent variable model for multi-class text classification and text generation.
The proposed model employs the concept of transfer learning for training a quantized transformer model.
Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.
arXiv Detail & Related papers (2021-11-10T07:07:54Z) - Learning Debiased and Disentangled Representations for Semantic
Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation.
By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes.
Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z) - Non-Autoregressive Translation by Learning Target Categorical Codes [59.840510037250944]
We propose CNAT, which learns implicitly categorical codes as latent variables into the non-autoregressive decoding.
Experiment results show that our model achieves comparable or better performance in machine translation tasks.
arXiv Detail & Related papers (2021-03-21T14:12:34Z) - Scalable Variational Gaussian Process Regression Networks [19.699020509495437]
We propose a scalable variational inference algorithm for GPRN.
We tensorize the output space and introduce tensor/matrix-normal variational posteriors to capture the posterior correlations.
We demonstrate the advantages of our method in several real-world applications.
arXiv Detail & Related papers (2020-03-25T16:39:47Z) - Boosting Ridge Regression for High Dimensional Data Classification [0.8223798883838329]
Ridge regression is a well established regression estimator which can be adapted for classification problems.
The closed-form solution which involves inverting the regularised covariance matrix is rather expensive to compute.
In this paper, we consider learning an ensemble of ridge regressors where each regressor is trained in its own randomly projected subspace.
arXiv Detail & Related papers (2020-03-25T09:07:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.