A critical analysis of metrics used for measuring progress in artificial
intelligence
- URL: http://arxiv.org/abs/2008.02577v2
- Date: Mon, 8 Nov 2021 14:38:58 GMT
- Title: A critical analysis of metrics used for measuring progress in artificial
intelligence
- Authors: Kathrin Blagec, Georg Dorffner, Milad Moradi, Matthias Samwald
- Abstract summary: We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results.
Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance.
We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
- Score: 9.387811897655016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Comparing model performances on benchmark datasets is an integral part of
measuring and driving progress in artificial intelligence. A model's
performance on a benchmark dataset is commonly assessed based on a single or a
small set of performance metrics. While this enables quick comparisons, it may
entail the risk of inadequately reflecting model performance if the metric does
not sufficiently cover all performance characteristics. It is unknown to what
extent this might impact benchmarking efforts.
To address this question, we analysed the current landscape of performance
metrics based on data covering 3867 machine learning model performance results
from the open repository 'Papers with Code'. Our results suggest that the large
majority of metrics currently used have properties that may result in an
inadequate reflection of a models' performance. While alternative metrics that
address problematic properties have been proposed, they are currently rarely
used.
Furthermore, we describe ambiguities in reported metrics, which may lead to
difficulties in interpreting and comparing model performances.
Related papers
- Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport [8.425690424016986]
Gauging the performance of ML models on data from unseen domains at test-time is essential.
It is essential to develop metrics that can provide insights into the model's performance at test time.
We propose a metric based on Optimal Transport that is highly correlated with the model's performance on unseen domains.
arXiv Detail & Related papers (2024-05-02T16:35:07Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - A global analysis of metrics used for measuring performance in natural
language processing [9.433496814327086]
We provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing.
Results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance.
arXiv Detail & Related papers (2022-04-25T11:41:50Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z) - Interpretable Meta-Measure for Model Performance [4.91155110560629]
We introduce a new meta-score assessment named Elo-based Predictive Power (EPP)
EPP is built on top of other performance measures and allows for interpretable comparisons of models.
We prove the mathematical properties of EPP and support them with empirical results of a large scale benchmark on 30 classification data sets and a real-world benchmark for visual data.
arXiv Detail & Related papers (2020-06-02T14:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.