Related papers: Measuring training variability from stochastic optimization using robust nonparametric testing

Measuring training variability from stochastic optimization using robust nonparametric testing

URL: http://arxiv.org/abs/2406.08307v2
Date: Tue, 15 Apr 2025 18:34:06 GMT
Title: Measuring training variability from stochastic optimization using robust nonparametric testing
Authors: Sinjini Banerjee, Tim Marrinan, Reilly Cannon, Tony Chiang, Anand D. Sarwate,
Abstract summary: We propose a robust hypothesis testing framework and a novel summary statistic, the $alpha$-trimming level, to measure model similarity.<n>Applying hypothesis testing directly with the $alpha$-trimming level is challenging because we cannot accurately describe the distribution under the null hypothesis.<n>We show how to use the $alpha$-trimming level to measure model variability and demonstrate experimentally that it is more expressive than performance metrics.
Score: 5.519968037738177
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep neural network training often involves stochastic optimization, meaning each run will produce a different model. This implies that hyperparameters of the training process, such as the random seed itself, can potentially have significant influence on the variability in the trained models. Measuring model quality by summary statistics, such as test accuracy, can obscure this dependence. We propose a robust hypothesis testing framework and a novel summary statistic, the $\alpha$-trimming level, to measure model similarity. Applying hypothesis testing directly with the $\alpha$-trimming level is challenging because we cannot accurately describe the distribution under the null hypothesis. Our framework addresses this issue by determining how closely an approximate distribution resembles the expected distribution of a group of individually trained models and using this approximation as our reference. We then use the $\alpha$-trimming level to suggest how many training runs should be sampled to ensure that an ensemble is a reliable representative of the true model performance. We also show how to use the $\alpha$-trimming level to measure model variability and demonstrate experimentally that it is more expressive than performance metrics like validation accuracy, churn, or expected calibration error when taken alone. An application of fine-tuning over random seed in transfer learning illustrates the advantage of our new metric.

Related papers

Quantifying Uncertainty and Variability in Machine Learning: Confidence Intervals for Quantiles in Performance Metric Distributions [0.17265013728931003]
Machine learning models are widely used in applications where reliability and robustness are critical. Model evaluation often relies on single-point estimates of performance metrics that fail to capture the inherent variability in model performance. This contribution explores the use of quantiles and confidence intervals to analyze such distributions, providing a more complete understanding of model performance and its uncertainty.
arXiv Detail & Related papers (2025-01-28T13:21:34Z)
Source-Free Unsupervised Domain Adaptation with Hypothesis Consolidation of Prediction Rationale [53.152460508207184]
Source-Free Unsupervised Domain Adaptation (SFUDA) is a challenging task where a model needs to be adapted to a new domain without access to target domain labels or source domain data. This paper proposes a novel approach that considers multiple prediction hypotheses for each sample and investigates the rationale behind each hypothesis. To achieve the optimal performance, we propose a three-step adaptation process: model pre-adaptation, hypothesis consolidation, and semi-supervised learning.
arXiv Detail & Related papers (2024-02-02T05:53:22Z)
Robust Nonparametric Hypothesis Testing to Understand Variability in Training Neural Networks [5.8490454659691355]
We propose a new measure of closeness between classification models based on the output of the network before thresholding. Our measure is based on a robust hypothesis-testing framework and can be adapted to other quantities derived from trained models.
arXiv Detail & Related papers (2023-10-01T01:44:35Z)
Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness [86.61582747039053]
Language model training in distributed settings is limited by the communication cost of exchanges. We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
arXiv Detail & Related papers (2023-06-16T17:59:51Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
Estimating Regression Predictive Distributions with Sample Networks [17.935136717050543]
A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation. The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates. We propose SampleNet, a flexible and scalable architecture for modeling uncertainty that avoids specifying a parametric form on the output distribution.
arXiv Detail & Related papers (2022-11-24T17:23:29Z)
We need to talk about random seeds [16.33770822558325]
This opinion piece argues that there are some safe uses for random seeds. An analysis of 85 recent publications from the ACL Anthology finds that more than 50% contain risky uses of random seeds.
arXiv Detail & Related papers (2022-10-24T16:48:45Z)
Post-Selection Confidence Bounds for Prediction Performance [2.28438857884398]
In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks. We propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set.
arXiv Detail & Related papers (2022-10-24T13:28:43Z)
Demystifying Randomly Initialized Networks for Evaluating Generative Models [28.8899914083501]
Evaluation of generative models is mostly based on the comparison between the estimated distribution and the ground truth distribution in a certain feature space. To embed samples into informative features, previous works often use convolutional neural networks optimized for classification. In this paper, we rigorously investigate the feature space of models with random weights in comparison to that of trained models.
arXiv Detail & Related papers (2022-08-19T08:43:53Z)
Convergence for score-based generative modeling with polynomial complexity [9.953088581242845]
We prove the first convergence guarantees for the core mechanic behind Score-based generative modeling. Compared to previous works, we do not incur error that grows exponentially in time or that suffers from a curse of dimensionality. We show that a predictor-corrector gives better convergence than using either portion alone.
arXiv Detail & Related papers (2022-06-13T14:57:35Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model. We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z)
Training on Test Data with Bayesian Adaptation for Covariate Shift [96.3250517412545]
Deep neural networks often make inaccurate predictions with unreliable uncertainty estimates. We derive a Bayesian model that provides for a well-defined relationship between unlabeled inputs under distributional shift and model parameters. We show that our method improves both accuracy and uncertainty estimation.
arXiv Detail & Related papers (2021-09-27T01:09:08Z)
Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose? [0.2836066255205732]
We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models. We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin. We also found that deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts.
arXiv Detail & Related papers (2021-07-24T11:38:25Z)
On Misspecification in Prediction Problems and Robustness via Improper Learning [23.64462813525688]
We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a "proper" predictor has lower bound scaling at least as $sqrtgamma n$. We exhibit instances in which this is unimprovable even over the family of all learners that may play distributions in the convex hull of the parametric family.
arXiv Detail & Related papers (2021-01-13T17:54:08Z)
Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
On the Discrepancy between Density Estimation and Sequence Generation [92.70116082182076]
log-likelihood is highly correlated with BLEU when we consider models within the same family. We observe no correlation between rankings of models across different families.
arXiv Detail & Related papers (2020-02-17T20:13:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.