The choice of scaling technique matters for classification performance
- URL: http://arxiv.org/abs/2212.12343v1
- Date: Fri, 23 Dec 2022 13:51:45 GMT
- Title: The choice of scaling technique matters for classification performance
- Authors: Lucas B.V. de Amorim, George D.C. Cavalcanti and Rafael M.O. Cruz
- Abstract summary: We compare the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models.
Results show that the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases.
We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model.
- Score: 6.745479230590518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset scaling, also known as normalization, is an essential preprocessing
step in a machine learning pipeline. It is aimed at adjusting attributes scales
in a way that they all vary within the same range. This transformation is known
to improve the performance of classification models, but there are several
scaling techniques to choose from, and this choice is not generally done
carefully. In this paper, we execute a broad experiment comparing the impact of
5 scaling techniques on the performances of 20 classification algorithms among
monolithic and ensemble models, applying them to 82 publicly available datasets
with varying imbalance ratios. Results show that the choice of scaling
technique matters for classification performance, and the performance
difference between the best and the worst scaling technique is relevant and
statistically significant in most cases. They also indicate that choosing an
inadequate technique can be more detrimental to classification performance than
not scaling the data at all. We also show how the performance variation of an
ensemble model, considering different scaling techniques, tends to be dictated
by that of its base model. Finally, we discuss the relationship between a
model's sensitivity to the choice of scaling technique and its performance and
provide insights into its applicability on different model deployment
scenarios. Full results and source code for the experiments in this paper are
available in a GitHub
repository.\footnote{https://github.com/amorimlb/scaling\_matters}
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - DTization: A New Method for Supervised Feature Scaling [0.0]
Feature scaling is one of the data pre-processing techniques that improves the performance of machine learning algorithms.
We have presented a novel feature scaling technique named DTization that employs decision tree and robust scaler for supervised feature scaling.
The results show a noteworthy performance improvement compared to the traditional feature scaling methods.
arXiv Detail & Related papers (2024-04-27T15:25:03Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Performance Scaling via Optimal Transport: Enabling Data Selection from
Partially Revealed Sources [9.359395812292291]
This paper proposes a framework called or>, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources.
or> significantly improves existing performance scaling approaches in terms of both accuracy of performance inference and computation costs associated with constructing the performance.
Also, or> outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.
arXiv Detail & Related papers (2023-07-05T17:33:41Z) - A Comparison of Modeling Preprocessing Techniques [0.0]
This paper compares the performance of various data processing methods in terms of predictive performance for structured data.
Three data sets of various structures, interactions, and complexity were constructed.
We compare several methods for feature selection, categorical handling, and null imputation.
arXiv Detail & Related papers (2023-02-23T14:11:08Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Fair Comparison: Quantifying Variance in Resultsfor Fine-grained Visual
Categorization [0.5735035463793008]
Average categorization accuracy is often used in isolation.
As the number of classes increases, the amount of information conveyed by average accuracy alone dwindles.
While its most glaring weakness is its failure to describe the model's performance on a class-by-class basis, average accuracy also fails to describe how performance may vary from one trained model of the same architecture, to another.
arXiv Detail & Related papers (2021-09-07T15:47:27Z) - Adaptive Threshold for Better Performance of the Recognition and
Re-identification Models [0.0]
An online optimization-based statistical feature learning adaptive technique is developed and tested on the LFW datasets and self-prepared athletes datasets.
This method of adopting adaptive threshold resulted in 12-45% improvement in the model accuracy compared to the fixed threshold 0.3,0.5,0.7 that are usually taken via the hit-and-trial method in any classification and identification tasks.
arXiv Detail & Related papers (2020-12-28T15:40:53Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z) - Learning to Select Base Classes for Few-shot Classification [96.92372639495551]
We use the Similarity Ratio as an indicator for the generalization performance of a few-shot model.
We then formulate the base class selection problem as a submodular optimization problem over Similarity Ratio.
arXiv Detail & Related papers (2020-04-01T09:55:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.