A Comparison of Machine Learning Methods for Data with High-Cardinality
Categorical Variables
- URL: http://arxiv.org/abs/2307.02071v1
- Date: Wed, 5 Jul 2023 07:26:27 GMT
- Title: A Comparison of Machine Learning Methods for Data with High-Cardinality
Categorical Variables
- Authors: Fabio Sigrist
- Abstract summary: Machine learning methods can have difficulties with high-cardinality variables.
In this article, we empirically compare several versions of two of the most successful machine learning methods.
- Score: 6.85316573653194
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: High-cardinality categorical variables are variables for which the number of
different levels is large relative to the sample size of a data set, or in
other words, there are few data points per level. Machine learning methods can
have difficulties with high-cardinality variables. In this article, we
empirically compare several versions of two of the most successful machine
learning methods, tree-boosting and deep neural networks, and linear mixed
effects models using multiple tabular data sets with high-cardinality
categorical variables. We find that, first, machine learning models with random
effects have higher prediction accuracy than their classical counterparts
without random effects, and, second, tree-boosting with random effects
outperforms deep neural networks with random effects.
Related papers
- Machine Learning for predicting chaotic systems [0.0]
We show that well-tuned simple methods, as well as untuned baseline methods, often outperform state-of-the-art deep learning models.
These findings underscore the importance of matching prediction methods to data characteristics and available computational resources.
arXiv Detail & Related papers (2024-07-29T16:34:47Z) - Can neural networks count digit frequency? [16.04455549316468]
We compare the performance of different classical machine learning models and neural networks in identifying the frequency of occurrence of each digit in a given number.
We observe that the neural networks significantly outperform the classical machine learning models in terms of both the regression and classification metrics for both the 6-digit and 10-digit number.
arXiv Detail & Related papers (2023-09-25T03:45:36Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Learning Likelihood Ratios with Neural Network Classifiers [0.12277343096128711]
approximations of the likelihood ratio may be computed using clever parametrizations of neural network-based classifiers.
We present a series of empirical studies detailing the performance of several common loss functionals and parametrizations of the classifier output.
arXiv Detail & Related papers (2023-05-17T18:11:38Z) - Hypothesis Testing and Machine Learning: Interpreting Variable Effects
in Deep Artificial Neural Networks using Cohen's f2 [0.0]
Deep artificial neural networks show high predictive performance in many fields.
But they do not afford statistical inferences and their black-box operations are too complicated for humans to comprehend.
This article extends current XAI methods and develops a model agnostic hypothesis testing framework for machine learning.
arXiv Detail & Related papers (2023-02-02T20:43:37Z) - BeCAPTCHA-Type: Biometric Keystroke Data Generation for Improved Bot
Detection [63.447493500066045]
This work proposes a data driven learning model for the synthesis of keystroke biometric data.
The proposed method is compared with two statistical approaches based on Universal and User-dependent models.
Our experimental framework considers a dataset with 136 million keystroke events from 168 thousand subjects.
arXiv Detail & Related papers (2022-07-27T09:26:15Z) - Quantifying Inherent Randomness in Machine Learning Algorithms [7.591218883378448]
This paper uses an empirical study to examine the effects of randomness in model training and randomness in the partitioning of a dataset into training and test subsets.
We quantify and compare the magnitude of the variation in predictive performance for the following ML algorithms: Random Forests (RFs), Gradient Boosting Machines (GBMs), and Feedforward Neural Networks (FFNNs)
arXiv Detail & Related papers (2022-06-24T15:49:52Z) - Dynamically-Scaled Deep Canonical Correlation Analysis [77.34726150561087]
Canonical Correlation Analysis (CCA) is a method for feature extraction of two views by finding maximally correlated linear projections of them.
We introduce a novel dynamic scaling method for training an input-dependent canonical correlation model.
arXiv Detail & Related papers (2022-03-23T12:52:49Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one.
Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP.
We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.