Evaluating machine learning models in non-standard settings: An overview
and new findings
- URL: http://arxiv.org/abs/2310.15108v1
- Date: Mon, 23 Oct 2023 17:15:11 GMT
- Title: Evaluating machine learning models in non-standard settings: An overview
and new findings
- Authors: Roman Hornung, Malte Nalenz, Lennart Schneider, Andreas Bender, Ludwig
Bothmann, Bernd Bischl, Thomas Augustin, Anne-Laure Boulesteix
- Abstract summary: Estimating the generalization error (GE) of machine learning models is fundamental.
In non-standard settings, particularly those where observations are not independently and identically distributed, resampling may lead to biased GE estimates.
This paper presents well-grounded guidelines for GE estimation in various such non-standard settings.
- Score: 7.834267158484847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating the generalization error (GE) of machine learning models is
fundamental, with resampling methods being the most common approach. However,
in non-standard settings, particularly those where observations are not
independently and identically distributed, resampling using simple random data
divisions may lead to biased GE estimates. This paper strives to present
well-grounded guidelines for GE estimation in various such non-standard
settings: clustered data, spatial data, unequal sampling probabilities, concept
drift, and hierarchically structured outcomes. Our overview combines
well-established methodologies with other existing methods that, to our
knowledge, have not been frequently considered in these particular settings. A
unifying principle among these techniques is that the test data used in each
iteration of the resampling procedure should reflect the new observations to
which the model will be applied, while the training data should be
representative of the entire data set used to obtain the final model. Beyond
providing an overview, we address literature gaps by conducting simulation
studies. These studies assess the necessity of using GE-estimation methods
tailored to the respective setting. Our findings corroborate the concern that
standard resampling methods often yield biased GE estimates in non-standard
settings, underscoring the importance of tailored GE estimation.
Related papers
- Generalization is not a universal guarantee: Estimating similarity to training data with an ensemble out-of-distribution metric [0.09363323206192666]
Failure of machine learning models to generalize to new data is a core problem limiting the reliability of AI systems.
We propose a standardized approach for assessing data similarity by constructing a supervised autoencoder for generalizability estimation (SAGE)
We show that out-of-the-box model performance increases after SAGE score filtering, even when applied to data from the model's own training and test datasets.
arXiv Detail & Related papers (2025-02-22T19:21:50Z) - Deep evolving semi-supervised anomaly detection [14.027613461156864]
The aim of this paper is to formalise the task of continual semi-supervised anomaly detection (CSAD)
The paper introduces a baseline model of a variational autoencoder (VAE) to work with semi-supervised data along with a continual learning method of deep generative replay with outlier rejection.
arXiv Detail & Related papers (2024-12-01T15:48:37Z) - Aggregation Weighting of Federated Learning via Generalization Bound
Estimation [65.8630966842025]
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions.
We replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model.
arXiv Detail & Related papers (2023-11-10T08:50:28Z) - A Generic Machine Learning Framework for Fully-Unsupervised Anomaly
Detection with Contaminated Data [0.0]
We introduce a framework for a fully unsupervised refinement of contaminated training data for AD tasks.
The framework is generic and can be applied to any residual-based machine learning model.
We show its clear superiority over the naive approach of training with contaminated data without refinement.
arXiv Detail & Related papers (2023-08-25T12:47:59Z) - Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.
Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z) - Unsupervised Anomaly Detection via Nonlinear Manifold Learning [0.0]
Anomalies are samples that significantly deviate from the rest of the data and their detection plays a major role in building machine learning models.
We introduce a robust, efficient, and interpretable methodology based on nonlinear manifold learning to detect anomalies in unsupervised settings.
arXiv Detail & Related papers (2023-06-15T18:48:10Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - MRCLens: an MRC Dataset Bias Detection Toolkit [82.44296974850639]
We introduce MRCLens, a toolkit that detects whether biases exist before users train the full model.
For the convenience of introducing the toolkit, we also provide a categorization of common biases in MRC.
arXiv Detail & Related papers (2022-07-18T21:05:39Z) - Studying Generalization Through Data Averaging [0.0]
We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples.
We predict some aspects about how the generalization gap and model train and test performance vary as a function of SGD noise.
arXiv Detail & Related papers (2022-06-28T00:03:40Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.