Beyond Random Split for Assessing Statistical Model Performance
- URL: http://arxiv.org/abs/2209.03346v1
- Date: Sun, 4 Sep 2022 22:24:35 GMT
- Title: Beyond Random Split for Assessing Statistical Model Performance
- Authors: Carlos Catania and Jorge Guerra and Juan Manuel Romero and Gabriel
Caffaratti and Martin Marchetta
- Abstract summary: We analyze strategies based on predictors' variability to split in training and testing sets.
Such strategies aim at guaranteeing the inclusion of rare or unusual examples with a minimal loss of the population's representativeness.
Preliminary results showed the importance of applying the three alternative strategies to the Monte Carlo splitting strategy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Even though a train/test split of the dataset randomly performed is a common
practice, could not always be the best approach for estimating performance
generalization under some scenarios. The fact is that the usual machine
learning methodology can sometimes overestimate the generalization error when a
dataset is not representative or when rare and elusive examples are a
fundamental aspect of the detection problem. In the present work, we analyze
strategies based on the predictors' variability to split in training and
testing sets. Such strategies aim at guaranteeing the inclusion of rare or
unusual examples with a minimal loss of the population's representativeness and
provide a more accurate estimation about the generalization error when the
dataset is not representative. Two baseline classifiers based on decision trees
were used for testing the four splitting strategies considered. Both
classifiers were applied on CTU19 a low-representative dataset for a network
security detection problem. Preliminary results showed the importance of
applying the three alternative strategies to the Monte Carlo splitting strategy
in order to get a more accurate error estimation on different but feasible
scenarios.
Related papers
- Semi-supervised Learning For Robust Speech Evaluation [30.593420641501968]
Speech evaluation measures a learners oral proficiency using automatic models.
This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization.
An anchor model is trained using pseudo labels to predict the correctness of pronunciation.
arXiv Detail & Related papers (2024-09-23T02:11:24Z) - Prediction-powered Generalization of Causal Inferences [6.43357871718189]
We show how the limited size of trials makes generalization a statistically infeasible task.
We develop generalization algorithms that supplement the trial data with a prediction model learned from an additional observational study.
arXiv Detail & Related papers (2024-06-05T02:44:14Z) - Restoring balance: principled under/oversampling of data for optimal classification [0.0]
Class imbalance in real-world data poses a common bottleneck for machine learning tasks.
Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically.
We provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered.
arXiv Detail & Related papers (2024-05-15T17:45:34Z) - A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment
for Imbalanced Learning [129.63326990812234]
We propose a technique named data-dependent contraction to capture how modified losses handle different classes.
On top of this technique, a fine-grained generalization bound is established for imbalanced learning, which helps reveal the mystery of re-weighting and logit-adjustment.
arXiv Detail & Related papers (2023-10-07T09:15:08Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - A Statistical Model for Predicting Generalization in Few-Shot
Classification [6.158812834002346]
We introduce a Gaussian model of the feature distribution to predict the generalization error.
We show that our approach outperforms alternatives such as the leave-one-out cross-validation strategy.
arXiv Detail & Related papers (2022-12-13T10:21:15Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Unsupervised Learning of Debiased Representations with Pseudo-Attributes [85.5691102676175]
We propose a simple but effective debiasing technique in an unsupervised manner.
We perform clustering on the feature embedding space and identify pseudoattributes by taking advantage of the clustering results.
We then employ a novel cluster-based reweighting scheme for learning debiased representation.
arXiv Detail & Related papers (2021-08-06T05:20:46Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.