A Comparison of Modeling Preprocessing Techniques
- URL: http://arxiv.org/abs/2302.12042v2
- Date: Fri, 24 Feb 2023 02:18:19 GMT
- Title: A Comparison of Modeling Preprocessing Techniques
- Authors: Tosan Johnson, Alice J. Liu, Syed Raza, Aaron McGuire
- Abstract summary: This paper compares the performance of various data processing methods in terms of predictive performance for structured data.
Three data sets of various structures, interactions, and complexity were constructed.
We compare several methods for feature selection, categorical handling, and null imputation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper compares the performance of various data processing methods in
terms of predictive performance for structured data. This paper also seeks to
identify and recommend preprocessing methodologies for tree-based binary
classification models, with a focus on eXtreme Gradient Boosting (XGBoost)
models. Three data sets of various structures, interactions, and complexity
were constructed, which were supplemented by a real-world data set from the
Lending Club. We compare several methods for feature selection, categorical
handling, and null imputation. Performance is assessed using relative
comparisons among the chosen methodologies, including model prediction
variability. This paper is presented by the three groups of preprocessing
methodologies, with each section consisting of generalized observations. Each
observation is accompanied by a recommendation of one or more preferred
methodologies. Among feature selection methods, permutation-based feature
importance, regularization, and XGBoost's feature importance by weight are not
recommended. The correlation coefficient reduction also shows inferior
performance. Instead, XGBoost importance by gain shows the most consistency and
highest caliber of performance. Categorical featuring encoding methods show
greater discrimination in performance among data set structures. While there
was no universal "best" method, frequency encoding showed the greatest
performance for the most complex data sets (Lending Club), but had the poorest
performance for all synthetic (i.e., simpler) data sets. Finally, missing
indicator imputation dominated in terms of performance among imputation
methods, whereas tree imputation showed extremely poor and highly variable
model performance.
Related papers
- Utilising Explainable Techniques for Quality Prediction in a Complex Textiles Manufacturing Use Case [0.0]
This paper develops an approach to classify instances of product failure in a complex textiles manufacturing dataset using explainable techniques.
In investigating the trade-off between accuracy and explainability, three different tree-based classification algorithms were evaluated.
arXiv Detail & Related papers (2024-07-26T06:50:17Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - The choice of scaling technique matters for classification performance [6.745479230590518]
We compare the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models.
Results show that the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases.
We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model.
arXiv Detail & Related papers (2022-12-23T13:51:45Z) - Deep Negative Correlation Classification [82.45045814842595]
Existing deep ensemble methods naively train many different models and then aggregate their predictions.
We propose deep negative correlation classification (DNCC)
DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated.
arXiv Detail & Related papers (2022-12-14T07:35:20Z) - Ensemble Classifier Design Tuned to Dataset Characteristics for Network
Intrusion Detection [0.0]
Two new algorithms are proposed to address the class overlap issue in the dataset.
The proposed design is evaluated for both binary and multi-category classification.
arXiv Detail & Related papers (2022-05-08T21:06:42Z) - Performance and Interpretability Comparisons of Supervised Machine
Learning Algorithms: An Empirical Study [3.7881729884531805]
The paper is organized in a findings-based manner, with each section providing general conclusions.
Overall, XGB and FFNNs were competitive, with FFNNs showing better performance in smooth models.
RF did not perform well in general, confirming the findings in the literature.
arXiv Detail & Related papers (2022-04-27T12:04:33Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning.
The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned.
Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z) - Structured Graph Learning for Clustering and Semi-supervised
Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data.
Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure.
Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z) - Does imputation matter? Benchmark for predictive models [5.802346990263708]
This paper systematically evaluates the empirical effectiveness of data imputation algorithms for predictive models.
The main contributions are (1) the recommendation of a general method for empirical benchmarking based on real-life classification tasks.
arXiv Detail & Related papers (2020-07-06T15:47:36Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.