Efficient Representations for High-Cardinality Categorical Variables in Machine Learning
- URL: http://arxiv.org/abs/2501.05646v1
- Date: Fri, 10 Jan 2025 01:25:01 GMT
- Title: Efficient Representations for High-Cardinality Categorical Variables in Machine Learning
- Authors: Zixuan Liang,
- Abstract summary: Highcardinality categorical variables pose significant challenges in machine learning.
Traditional onehot encoding often results in highdimensional sparse feature spaces.
This paper introduces novel encoding techniques, including means encoding, lowrank encoding, and multinomial logistic regression encoding.
- Score: 0.0
- License:
- Abstract: High\-cardinality categorical variables pose significant challenges in machine learning, particularly in terms of computational efficiency and model interpretability. Traditional one\-hot encoding often results in high\-dimensional sparse feature spaces, increasing the risk of overfitting and reducing scalability. This paper introduces novel encoding techniques, including means encoding, low\-rank encoding, and multinomial logistic regression encoding, to address these challenges. These methods leverage sufficient representations to generate compact and informative embeddings of categorical data. We conduct rigorous theoretical analyses and empirical validations on diverse datasets, demonstrating significant improvements in model performance and computational efficiency compared to baseline methods. The proposed techniques are particularly effective in domains requiring scalable solutions for large datasets, paving the way for more robust and efficient applications in machine learning.
Related papers
- DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.
We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.
Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - Contextual Reinforcement in Multimodal Token Compression for Large Language Models [0.0]
token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets.
A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance.
This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation.
arXiv Detail & Related papers (2025-01-28T02:44:31Z) - Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks [6.596361762662328]
Internal structure and operation mechanism of large-scale language models are analyzed theoretically.
We evaluate the contribution of adaptive optimization algorithms (such as AdamW), massively parallel computing techniques, and mixed precision training strategies.
arXiv Detail & Related papers (2024-05-20T00:10:00Z) - A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers.
This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models.
Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z) - Randomized Dimension Reduction with Statistical Guarantees [0.27195102129095]
This thesis explores some of such algorithms for fast execution and efficient data utilization.
We focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional provably.
Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization.
arXiv Detail & Related papers (2023-10-03T02:01:39Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z) - Hybridization of Capsule and LSTM Networks for unsupervised anomaly
detection on multivariate data [0.0]
This paper introduces a novel NN architecture which hybridises the Long-Short-Term-Memory (LSTM) and Capsule Networks into a single network.
The proposed method uses an unsupervised learning technique to overcome the issues with finding large volumes of labelled training data.
arXiv Detail & Related papers (2022-02-11T10:33:53Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Learning Generalized Relational Heuristic Networks for Model-Agnostic
Planning [29.714818991696088]
This paper develops a new approach for learning generalizeds in the absence of symbolic action models.
It uses an abstract state representation to facilitate data efficient, generalizable learning.
arXiv Detail & Related papers (2020-07-10T06:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.