A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?
- URL: http://arxiv.org/abs/2503.16589v1
- Date: Thu, 20 Mar 2025 17:38:50 GMT
- Title: A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?
- Authors: Moslem Noori, Elisabetta Valiante, Thomas Van Vaerenbergh, Masoud Mohseni, Ignacio Rozada,
- Abstract summary: We present a statistical analysis of the common metrics, and develop guidelines for experiment design.<n>We derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics.<n>We propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric.
- Score: 0.8575004906002217
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the problem. However, the accuracy of the estimated performance metrics depends on the number of runs and should be studied using statistical tools. We present a statistical analysis of the common metrics, and develop guidelines for experiment design to measure the optimizer's performance using these metrics to a high level of confidence and accuracy. To this end, we first discuss the confidence interval of the metrics and how they are related to the number of runs of an experiment. We then derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics. Using this bound, we propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric. Our simulation results demonstrate the utility of our analysis and how it allows us to conduct reliable benchmarking as well as hyperparameter tuning and prevent us from drawing premature conclusions regarding the performance of stochastic optimizers.
Related papers
- Quantifying Uncertainty and Variability in Machine Learning: Confidence Intervals for Quantiles in Performance Metric Distributions [0.17265013728931003]
Machine learning models are widely used in applications where reliability and robustness are critical.<n>Model evaluation often relies on single-point estimates of performance metrics that fail to capture the inherent variability in model performance.<n>This contribution explores the use of quantiles and confidence intervals to analyze such distributions, providing a more complete understanding of model performance and its uncertainty.
arXiv Detail & Related papers (2025-01-28T13:21:34Z) - Semiparametric conformal prediction [79.6147286161434]
We construct a conformal prediction set accounting for the joint correlation structure of the vector-valued non-conformity scores.<n>We flexibly estimate the joint cumulative distribution function (CDF) of the scores.<n>Our method yields desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Online Estimation with Rolling Validation: Adaptive Nonparametric Estimation with Streaming Data [13.069717985067937]
We propose a weighted rolling validation procedure, an online variant of leave-one-out cross-validation, that costs minimal extra for many typical gradient descent estimators.<n>Our analysis is straightforward, relying mainly on some general statistical assumptions.
arXiv Detail & Related papers (2023-10-18T17:52:57Z) - Multi-Fidelity Covariance Estimation in the Log-Euclidean Geometry [0.0]
We introduce a multi-fidelity estimator of covariance matrices that employs the log-Euclidean geometry of the symmetric positive-definite manifold.
We develop an optimal sample allocation scheme that minimizes the mean-squared error of the estimator given a fixed budget.
Evaluations of our approach using data from physical applications demonstrate more accurate metric learning and speedups of more than one order of magnitude compared to benchmarks.
arXiv Detail & Related papers (2023-01-31T16:33:46Z) - Analyzing the Impact of Undersampling on the Benchmarking and
Configuration of Evolutionary Algorithms [3.967483941966979]
We show that care should be taken when making decisions based on limited data.
We show examples of performance losses of more than 20%, even when using statistical races to dynamically adjust the number of runs.
arXiv Detail & Related papers (2022-04-20T09:53:59Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - A Framework for Sample Efficient Interval Estimation with Control
Variates [94.32811054797148]
We consider the problem of estimating confidence intervals for the mean of a random variable.
Under certain conditions, we show improved efficiency compared to existing estimation algorithms.
arXiv Detail & Related papers (2020-06-18T05:42:30Z) - Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples.
We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.