Efficient Representations for High-Cardinality Categorical Variables in Machine Learning
- URL: http://arxiv.org/abs/2501.05646v1
- Date: Fri, 10 Jan 2025 01:25:01 GMT
- Title: Efficient Representations for High-Cardinality Categorical Variables in Machine Learning
- Authors: Zixuan Liang,
- Abstract summary: Highcardinality categorical variables pose significant challenges in machine learning.<n>Traditional onehot encoding often results in highdimensional sparse feature spaces.<n>This paper introduces novel encoding techniques, including means encoding, lowrank encoding, and multinomial logistic regression encoding.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High\-cardinality categorical variables pose significant challenges in machine learning, particularly in terms of computational efficiency and model interpretability. Traditional one\-hot encoding often results in high\-dimensional sparse feature spaces, increasing the risk of overfitting and reducing scalability. This paper introduces novel encoding techniques, including means encoding, low\-rank encoding, and multinomial logistic regression encoding, to address these challenges. These methods leverage sufficient representations to generate compact and informative embeddings of categorical data. We conduct rigorous theoretical analyses and empirical validations on diverse datasets, demonstrating significant improvements in model performance and computational efficiency compared to baseline methods. The proposed techniques are particularly effective in domains requiring scalable solutions for large datasets, paving the way for more robust and efficient applications in machine learning.
Related papers
- DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.
We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.
Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - Contextual Reinforcement in Multimodal Token Compression for Large Language Models [0.0]
token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets.
A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance.
This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation.
arXiv Detail & Related papers (2025-01-28T02:44:31Z) - Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.
Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade.
We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks [6.596361762662328]
Internal structure and operation mechanism of large-scale language models are analyzed theoretically.
We evaluate the contribution of adaptive optimization algorithms (such as AdamW), massively parallel computing techniques, and mixed precision training strategies.
arXiv Detail & Related papers (2024-05-20T00:10:00Z) - A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers.
This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models.
Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z) - Randomized Dimension Reduction with Statistical Guarantees [0.27195102129095]
This thesis explores some of such algorithms for fast execution and efficient data utilization.
We focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional provably.
Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization.
arXiv Detail & Related papers (2023-10-03T02:01:39Z) - On Robust Numerical Solver for ODE via Self-Attention Mechanism [82.95493796476767]
We explore training efficient and robust AI-enhanced numerical solvers with a small data size by mitigating intrinsic noise disturbances.
We first analyze the ability of the self-attention mechanism to regulate noise in supervised learning and then propose a simple-yet-effective numerical solver, Attr, which introduces an additive self-attention mechanism to the numerical solution of differential equations.
arXiv Detail & Related papers (2023-02-05T01:39:21Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z) - Hybridization of Capsule and LSTM Networks for unsupervised anomaly
detection on multivariate data [0.0]
This paper introduces a novel NN architecture which hybridises the Long-Short-Term-Memory (LSTM) and Capsule Networks into a single network.
The proposed method uses an unsupervised learning technique to overcome the issues with finding large volumes of labelled training data.
arXiv Detail & Related papers (2022-02-11T10:33:53Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Active Importance Sampling for Variational Objectives Dominated by Rare
Events: Consequences for Optimization and Generalization [12.617078020344618]
We introduce an approach that combines rare events sampling techniques with neural network optimization to optimize objective functions dominated by rare events.
We show that importance sampling reduces the variance of the solution to a learning problem, suggesting benefits for generalization.
Our numerical experiments demonstrate that we can successfully learn even with the compounding difficulties of high-dimensional and rare data.
arXiv Detail & Related papers (2020-08-11T23:38:09Z) - Learning Generalized Relational Heuristic Networks for Model-Agnostic
Planning [29.714818991696088]
This paper develops a new approach for learning generalizeds in the absence of symbolic action models.
It uses an abstract state representation to facilitate data efficient, generalizable learning.
arXiv Detail & Related papers (2020-07-10T06:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.