Related papers: A Comparison of Modeling Preprocessing Techniques

A Comparison of Modeling Preprocessing Techniques

URL: http://arxiv.org/abs/2302.12042v2
Date: Fri, 24 Feb 2023 02:18:19 GMT
Title: A Comparison of Modeling Preprocessing Techniques
Authors: Tosan Johnson, Alice J. Liu, Syed Raza, Aaron McGuire
Abstract summary: This paper compares the performance of various data processing methods in terms of predictive performance for structured data. Three data sets of various structures, interactions, and complexity were constructed. We compare several methods for feature selection, categorical handling, and null imputation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper compares the performance of various data processing methods in terms of predictive performance for structured data. This paper also seeks to identify and recommend preprocessing methodologies for tree-based binary classification models, with a focus on eXtreme Gradient Boosting (XGBoost) models. Three data sets of various structures, interactions, and complexity were constructed, which were supplemented by a real-world data set from the Lending Club. We compare several methods for feature selection, categorical handling, and null imputation. Performance is assessed using relative comparisons among the chosen methodologies, including model prediction variability. This paper is presented by the three groups of preprocessing methodologies, with each section consisting of generalized observations. Each observation is accompanied by a recommendation of one or more preferred methodologies. Among feature selection methods, permutation-based feature importance, regularization, and XGBoost's feature importance by weight are not recommended. The correlation coefficient reduction also shows inferior performance. Instead, XGBoost importance by gain shows the most consistency and highest caliber of performance. Categorical featuring encoding methods show greater discrimination in performance among data set structures. While there was no universal "best" method, frequency encoding showed the greatest performance for the most complex data sets (Lending Club), but had the poorest performance for all synthetic (i.e., simpler) data sets. Finally, missing indicator imputation dominated in terms of performance among imputation methods, whereas tree imputation showed extremely poor and highly variable model performance.

Related papers

Utilising Explainable Techniques for Quality Prediction in a Complex Textiles Manufacturing Use Case [0.0]
This paper develops an approach to classify instances of product failure in a complex textiles manufacturing dataset using explainable techniques. In investigating the trade-off between accuracy and explainability, three different tree-based classification algorithms were evaluated.
arXiv Detail & Related papers (2024-07-26T06:50:17Z)
Beyond Simple Averaging: Improving NLP Ensemble Performance with Topological-Data-Analysis-Based Weighting [2.6862667248315386]
In natural language processing, ensembles boost the performance of a method due to multiple large models available in open source. We propose to estimate weights for ensembles of NLP models using not only knowledge of their individual performance but also their similarity to each other.
arXiv Detail & Related papers (2024-02-22T00:04:21Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
The choice of scaling technique matters for classification performance [6.745479230590518]
We compare the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models. Results show that the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model.
arXiv Detail & Related papers (2022-12-23T13:51:45Z)
Deep Negative Correlation Classification [82.45045814842595]
Existing deep ensemble methods naively train many different models and then aggregate their predictions. We propose deep negative correlation classification (DNCC) DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated.
arXiv Detail & Related papers (2022-12-14T07:35:20Z)
Ensemble Classifier Design Tuned to Dataset Characteristics for Network Intrusion Detection [0.0]
Two new algorithms are proposed to address the class overlap issue in the dataset. The proposed design is evaluated for both binary and multi-category classification.
arXiv Detail & Related papers (2022-05-08T21:06:42Z)
Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study [3.7881729884531805]
The paper is organized in a findings-based manner, with each section providing general conclusions. Overall, XGB and FFNNs were competitive, with FFNNs showing better performance in smooth models. RF did not perform well in general, confirming the findings in the literature.
arXiv Detail & Related papers (2022-04-27T12:04:33Z)
Compactness Score: A Fast Filter Method for Unsupervised Feature Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features. Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z)
Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning. The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned. Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z)
Structured Graph Learning for Clustering and Semi-supervised Classification [74.35376212789132]
We propose a graph learning framework to preserve both the local and global structure of data. Our method uses the self-expressiveness of samples to capture the global structure and adaptive neighbor approach to respect the local structure. Our model is equivalent to a combination of kernel k-means and k-means methods under certain condition.
arXiv Detail & Related papers (2020-08-31T08:41:20Z)
Does imputation matter? Benchmark for predictive models [5.802346990263708]
This paper systematically evaluates the empirical effectiveness of data imputation algorithms for predictive models. The main contributions are (1) the recommendation of a general method for empirical benchmarking based on real-life classification tasks.
arXiv Detail & Related papers (2020-07-06T15:47:36Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.