Related papers: Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

URL: http://arxiv.org/abs/2105.13783v1
Date: Thu, 27 May 2021 11:56:13 GMT
Title: Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems
Authors: Carlos Mougan, David Masip, Jordi Nin, Oriol Pujol
Abstract summary: We provide an in-depth analysis of how to tackle high cardinality categor-ical features with the quantile. Our proposal outperforms state-of-the-encoders, including the traditional statistical mean target encoder. We also describe how to expand the encoded values by creating a set of features with different quantiles.
Score: 2.3322477552758234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Regression problems have been widely studied in machinelearning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper,we provide an in-depth analysis of how to tackle high cardinality categor-ical features with the quantile. Our proposal outperforms state-of-the-encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.

Related papers

Generalization bounds for regression and classification on adaptive covering input domains [1.4141453107129398]
We focus on the generalization bound, which serves as an upper limit for the generalization error. In the case of classification tasks, we treat the target function as a one-hot, a piece-wise constant function, and employ 0/1 loss for error measurement.
arXiv Detail & Related papers (2024-07-29T05:40:08Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Robust Capped lp-Norm Support Vector Ordinal Regression [85.84718111830752]
Ordinal regression is a specialized supervised problem where the labels show an inherent order. Support Vector Ordinal Regression, as an outstanding ordinal regression model, is widely used in many ordinal regression tasks. We introduce a new model, Capped $ell_p$-Norm Support Vector Ordinal Regression(CSVOR), that is robust to outliers.
arXiv Detail & Related papers (2024-04-25T13:56:05Z)
An Ordinal Regression Framework for a Deep Learning Based Severity Assessment for Chest Radiographs [50.285682227571996]
We propose a framework that divides the ordinal regression problem into three parts: a model, a target function, and a classification function. We show that the choice of encoding has a strong impact on performance and that the best encoding depends on the chosen weighting of Cohen's kappa.
arXiv Detail & Related papers (2024-02-08T14:00:45Z)
Deep Imbalanced Regression via Hierarchical Classification Adjustment [50.19438850112964]
Regression tasks in computer vision are often formulated into classification by quantizing the target space into classes. The majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range. We propose to construct hierarchical classifiers for solving imbalanced regression tasks. Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks.
arXiv Detail & Related papers (2023-10-26T04:54:39Z)
Entropy optimized semi-supervised decomposed vector-quantized variational autoencoder model based on transfer learning for multiclass text classification and generation [3.9318191265352196]
We propose a semisupervised discrete latent variable model for multi-class text classification and text generation. The proposed model employs the concept of transfer learning for training a quantized transformer model. Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.
arXiv Detail & Related papers (2021-11-10T07:07:54Z)
Learning Debiased and Disentangled Representations for Semantic Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation. By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes. Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z)
Non-Autoregressive Translation by Learning Target Categorical Codes [59.840510037250944]
We propose CNAT, which learns implicitly categorical codes as latent variables into the non-autoregressive decoding. Experiment results show that our model achieves comparable or better performance in machine translation tasks.
arXiv Detail & Related papers (2021-03-21T14:12:34Z)
Scalable Variational Gaussian Process Regression Networks [19.699020509495437]
We propose a scalable variational inference algorithm for GPRN. We tensorize the output space and introduce tensor/matrix-normal variational posteriors to capture the posterior correlations. We demonstrate the advantages of our method in several real-world applications.
arXiv Detail & Related papers (2020-03-25T16:39:47Z)
Boosting Ridge Regression for High Dimensional Data Classification [0.8223798883838329]
Ridge regression is a well established regression estimator which can be adapted for classification problems. The closed-form solution which involves inverting the regularised covariance matrix is rather expensive to compute. In this paper, we consider learning an ensemble of ridge regressors where each regressor is trained in its own randomly projected subspace.
arXiv Detail & Related papers (2020-03-25T09:07:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.