Related papers: PCA-based Category Encoder for Categorical to Numerical Variable Conversion

PCA-based Category Encoder for Categorical to Numerical Variable Conversion

URL: http://arxiv.org/abs/2111.14839v1
Date: Mon, 29 Nov 2021 12:49:20 GMT
Title: PCA-based Category Encoder for Categorical to Numerical Variable Conversion
Authors: Hamed Farkhari, Joseanne Viana, Luis Miguel Campos, Pedro Sebastiao, Rodolfo Oliveira, Luis Bernardo
Abstract summary: Increasing the cardinality of categorical variables might decrease the overall performance of machine learning (ML) algorithms. This paper presents a novel computational preprocessing method to convert categorical to numerical variables. The proposed technique achieved the highest performance related to accuracy and Area under the curve (AUC) on high cardinality categorical variables.
Score: 1.1156827035309407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Increasing the cardinality of categorical variables might decrease the overall performance of ML algorithms. This paper presents a novel computational preprocessing method to convert categorical to numerical variables for machine learning (ML) algorithms. In this method, We select and convert three categorical features to numerical features. First, we choose the threshold parameter based on the distribution of categories in variables. Then, we use conditional probabilities to convert each categorical variable into two new numerical variables, resulting in six new numerical variables in total. After that, we feed these six numerical variables to the Principal Component Analysis (PCA) algorithm. Next, we select the whole or partial numbers of Principal Components (PCs). Finally, by applying binary classification with ten different classifiers, We measured the performance of the new encoder and compared it with the other 17 well-known category encoders. The proposed technique achieved the highest performance related to accuracy and Area under the curve (AUC) on high cardinality categorical variables using the well-known cybersecurity NSLKDD dataset. Also, we defined harmonic average metrics to find the best trade-off between train and test performance and prevent underfitting and overfitting. Ultimately, the number of newly created numerical variables is minimal. Consequently, this data reduction improves computational processing time which might reduce processing data in 5G future telecommunication networks.

Related papers

Fractional Naive Bayes (FNB): non-convex optimization for a parsimonious weighted selective naive Bayes classifier [0.0]
We supervised classification for datasets with a very large number of input variables. We propose a regularization of the model log-like Baylihood. The various proposed algorithms result in optimization-based weighted na"ivees scheme.
arXiv Detail & Related papers (2024-09-17T11:54:14Z)
Non-parametric Conditional Independence Testing for Mixed Continuous-Categorical Variables: A Novel Method and Numerical Evaluation [14.993705256147189]
Conditional independence testing (CIT) is a common task in machine learning. Many real-world applications involve mixed-type datasets that include numerical and categorical variables. We propose a variation of the former approach that does not treat categorical variables as numeric.
arXiv Detail & Related papers (2023-10-17T10:29:23Z)
Compound Batch Normalization for Long-tailed Image Classification [77.42829178064807]
We propose a compound batch normalization method based on a Gaussian mixture. It can model the feature space more comprehensively and reduce the dominance of head classes. The proposed method outperforms existing methods on long-tailed image classification.
arXiv Detail & Related papers (2022-12-02T07:31:39Z)
Prediction Calibration for Generalized Few-shot Semantic Segmentation [101.69940565204816]
Generalized Few-shot Semantic (GFSS) aims to segment each image pixel into either base classes with abundant training examples or novel classes with only a handful of (e.g., 1-5) training images per class. We build a cross-attention module that guides the classifier's final prediction using the fused multi-level features. Our PCN outperforms the state-the-art alternatives by large margins.
arXiv Detail & Related papers (2022-10-15T13:30:12Z)
Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences. Exact methods yield better classification performance, but they pose high computational costs. We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Determination of class-specific variables in nonparametric multiple-class classification [0.0]
We propose a probability-based nonparametric multiple-class classification method, and integrate it with the ability of identifying high impact variables for individual class. We report the properties of the proposed method, and use both synthesized and real data sets to illustrate its properties under different classification situations.
arXiv Detail & Related papers (2022-05-07T10:08:58Z)
Confusion-based rank similarity filters for computationally-efficient machine learning on high dimensional data [0.0]
We introduce a novel type of computationally efficient artificial neural network (ANN) called the rank similarity filter (RSF) RSFs can be used to transform and classify nonlinearly separable datasets with many data points and dimensions. Open-source code for RST, RSC and RSPC was written in Python using the popular scikit-learn framework to make it easily accessible.
arXiv Detail & Related papers (2021-09-28T10:53:38Z)
Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features [1.1709030738577393]
We study techniques that yield numeric representations of categorical variables. We compare different encoding strategies together with five machine learning algorithms. Regularized versions of target encoding consistently provided the best results.
arXiv Detail & Related papers (2021-04-01T17:21:42Z)
High-Dimensional Quadratic Discriminant Analysis under Spiked Covariance Model [101.74172837046382]
We propose a novel quadratic classification technique, the parameters of which are chosen such that the fisher-discriminant ratio is maximized. Numerical simulations show that the proposed classifier not only outperforms the classical R-QDA for both synthetic and real data but also requires lower computational complexity.
arXiv Detail & Related papers (2020-06-25T12:00:26Z)
Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients. We introduce a new sparsity operator: The random-top-k operator. Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.