Faithful Model Evaluation for Model-Based Metrics
- URL: http://arxiv.org/abs/2312.17254v1
- Date: Tue, 19 Dec 2023 19:41:33 GMT
- Title: Faithful Model Evaluation for Model-Based Metrics
- Authors: Palash Goyal, Qian Hu, Rahul Gupta
- Abstract summary: We establish the mathematical foundation of significance testing for model-based metrics.
We show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.
- Score: 22.753929098534403
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Statistical significance testing is used in natural language processing (NLP)
to determine whether the results of a study or experiment are likely to be due
to chance or if they reflect a genuine relationship. A key step in significance
testing is the estimation of confidence interval which is a function of sample
variance. Sample variance calculation is straightforward when evaluating
against ground truth. However, in many cases, a metric model is often used for
evaluation. For example, to compare toxicity of two large language models, a
toxicity classifier is used for evaluation. Existing works usually do not
consider the variance change due to metric model errors, which can lead to
wrong conclusions. In this work, we establish the mathematical foundation of
significance testing for model-based metrics. With experiments on public
benchmark datasets and a production system, we show that considering metric
model errors to calculate sample variances for model-based metrics changes the
conclusions in certain experiments.
Related papers
- Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification [3.1850615666574806]
This study investigates how consistent metrics are at evaluating different models under different data scenarios.
I find that for binary classification tasks, evaluation metrics that are less influenced by prevalence offer more consistent ranking of a set of different models.
arXiv Detail & Related papers (2024-08-19T17:52:38Z) - Toward Generalizable Machine Learning Models in Speech, Language, and
Hearing Sciences: Estimating Sample Size and Reducing Overfitting [1.8416014644193064]
This study uses Monte Carlo simulations to quantify the interactions between the employed cross-validation method and the discnative power of features.
The required sample size with a single holdout could be 50% higher than what would be needed if nested crossvalidation were used.
arXiv Detail & Related papers (2023-08-22T05:14:42Z) - Logistic Regression Equivalence: A Framework for Comparing Logistic
Regression Models Across Populations [4.518012967046983]
We argue that equivalence testing for a prespecified tolerance level on population differences incentivizes accuracy in the inference.
For diagnosis data, we show examples for equivalent and non-equivalent models.
arXiv Detail & Related papers (2023-03-23T15:12:52Z) - The Implicit Delta Method [61.36121543728134]
In this paper, we propose an alternative, the implicit delta method, which works by infinitesimally regularizing the training loss of uncertainty.
We show that the change in the evaluation due to regularization is consistent for the variance of the evaluation estimator, even when the infinitesimal change is approximated by a finite difference.
arXiv Detail & Related papers (2022-11-11T19:34:17Z) - On the Strong Correlation Between Model Invariance and Generalization [54.812786542023325]
Generalization captures a model's ability to classify unseen data.
Invariance measures consistency of model predictions on transformations of the data.
From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets.
arXiv Detail & Related papers (2022-07-14T17:08:25Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - Nonparametric Conditional Local Independence Testing [69.31200003384122]
Conditional local independence is an independence relation among continuous time processes.
No nonparametric test of conditional local independence has been available.
We propose such a nonparametric test based on double machine learning.
arXiv Detail & Related papers (2022-03-25T10:31:02Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Evaluation metrics for behaviour modeling [2.616915680939834]
We propose and investigate metrics for evaluating and comparing generative models of behavior learned using imitation learning.
These criteria look at longer temporal relationships in behavior, are relevant if behavior has some properties that are inherently unpredictable, and highlight biases in the overall distribution of behaviors produced by the model.
We show that the proposed metrics correspond with biologists' intuition about behavior, and allow us to evaluate models, understand their biases, and enable us to propose new research directions.
arXiv Detail & Related papers (2020-07-23T23:47:24Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.