Comparative Study on the Performance of Categorical Variable Encoders in
Classification and Regression Tasks
- URL: http://arxiv.org/abs/2401.09682v1
- Date: Thu, 18 Jan 2024 02:21:53 GMT
- Title: Comparative Study on the Performance of Categorical Variable Encoders in
Classification and Regression Tasks
- Authors: Wenbin Zhu, Runwen Qiu and Ying Fu
- Abstract summary: This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs; 2) Tree-based models that are based on decision trees; and 3) the rest, such as kNN.
Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data.
We also explain why the target encoder and its variants are the most suitable encoders for tree-based models.
- Score: 11.721062526796976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Categorical variables often appear in datasets for classification and
regression tasks, and they need to be encoded into numerical values before
training. Since many encoders have been developed and can significantly impact
performance, choosing the appropriate encoder for a task becomes a
time-consuming yet important practical issue. This study broadly classifies
machine learning models into three categories: 1) ATI models that implicitly
perform affine transformations on inputs, such as multi-layer perceptron neural
network; 2) Tree-based models that are based on decision trees, such as random
forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot
encoder is the best choice for ATI models in the sense that it can mimic any
other encoders by learning suitable weights from the data. We also explain why
the target encoder and its variants are the most suitable encoders for
tree-based models. This study conducted comprehensive computational experiments
to evaluate 14 encoders, including one-hot and target encoders, along with
eight common machine-learning models on 28 datasets. The computational results
agree with our theoretical analysis. The findings in this study shed light on
how to select the suitable encoder for data scientists in fields such as fraud
detection, disease diagnosis, etc.
Related papers
- Tissue Concepts: supervised foundation models in computational pathology [2.246872800470769]
Training foundation models themselves is usually very expensive in terms of data, computation, and time.
This paper proposes a supervised training method that drastically reduces these expenses.
The proposed method is based on multi-task learning to train a joint encoder, by combining 16 different classification, segmentation, and detection tasks on a total of 912,000 patches.
arXiv Detail & Related papers (2024-09-05T13:32:40Z) - A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks [81.2624272756733]
In dense retrieval, deep encoders provide embeddings for both inputs and targets.
We train a small parametric corrector network that adjusts stale cached target embeddings.
Our approach matches state-of-the-art results even when no target embedding updates are made during training.
arXiv Detail & Related papers (2024-09-03T13:29:13Z) - Explainable AI for Comparative Analysis of Intrusion Detection Models [20.683181384051395]
This research analyzes various machine learning models to the tasks of binary and multi-class classification for intrusion detection from network traffic.
We trained all models to the accuracy of 90% on the UNSW-NB15 dataset.
We also discover that Random Forest provides the best performance in terms of accuracy, time efficiency and robustness.
arXiv Detail & Related papers (2024-06-14T03:11:01Z) - 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling.
To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning.
In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level.
We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects.
We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z) - Knowledge-integrated AutoEncoder Model [0.0]
We introduce a novel approach for developing AE models that can integrate external knowledge sources into the learning process.
The proposed model is evaluated on three large-scale datasets from three different scientific fields.
arXiv Detail & Related papers (2023-03-12T18:00:12Z) - Cats: Complementary CNN and Transformer Encoders for Segmentation [13.288195115791758]
We propose a model with double encoders for 3D biomedical image segmentation.
We fuse the information from the convolutional encoder and the transformer, and pass it to the decoder to obtain the results.
Compared to the state-of-the-art models with and without transformers on each task, our proposed method obtains higher Dice scores across the board.
arXiv Detail & Related papers (2022-08-24T14:25:11Z) - Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant.
One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning.
Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks.
We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.