Interpretable Meta-Measure for Model Performance
- URL: http://arxiv.org/abs/2006.02293v2
- Date: Thu, 22 Sep 2022 15:17:16 GMT
- Title: Interpretable Meta-Measure for Model Performance
- Authors: Alicja Gosiewska and Katarzyna Wo\'znica and Przemys{\l}aw Biecek
- Abstract summary: We introduce a new meta-score assessment named Elo-based Predictive Power (EPP)
EPP is built on top of other performance measures and allows for interpretable comparisons of models.
We prove the mathematical properties of EPP and support them with empirical results of a large scale benchmark on 30 classification data sets and a real-world benchmark for visual data.
- Score: 4.91155110560629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benchmarks for the evaluation of model performance play an important role in
machine learning. However, there is no established way to describe and create
new benchmarks. What is more, the most common benchmarks use performance
measures that share several limitations. For example, the difference in
performance for two models has no probabilistic interpretation, there is no
reference point to indicate whether they represent a significant improvement,
and it makes no sense to compare such differences between data sets. We
introduce a new meta-score assessment named Elo-based Predictive Power (EPP)
that is built on top of other performance measures and allows for interpretable
comparisons of models. The differences in EPP scores have a probabilistic
interpretation and can be directly compared between data sets, furthermore, the
logistic regression-based design allows for an assessment of ranking fitness
based on a deviance statistic. We prove the mathematical properties of EPP and
support them with empirical results of a large scale benchmark on 30
classification data sets and a real-world benchmark for visual data.
Additionally, we propose a Unified Benchmark Ontology that is used to give a
uniform description of benchmarks.
Related papers
- Estimating Model Performance Under Covariate Shift Without Labels [9.804680621164168]
We introduce Probabilistic Adaptive Performance Estimation (PAPE) for evaluating classification models on unlabeled data.
PAPE provides more accurate performance estimates than other evaluated methodologies.
arXiv Detail & Related papers (2024-01-16T13:29:30Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Automatic Pharma News Categorization [0.0]
We use a text dataset consisting of 23 news categories relevant to pharma information science.
We compare the fine-tuning performance of multiple transformer models in a classification task.
We propose an ensemble model consisting of the top performing individual predictors.
arXiv Detail & Related papers (2021-12-28T08:42:16Z) - How not to Lie with a Benchmark: Rearranging NLP Leaderboards [0.0]
We examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean.
We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME.
arXiv Detail & Related papers (2021-12-02T15:40:52Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Preference Modeling with Context-Dependent Salient Features [12.403492796441434]
We consider the problem of estimating a ranking on a set of items from noisy pairwise comparisons given item features.
Our key observation is that two items compared in isolation from other items may be compared based on only a salient subset of features.
arXiv Detail & Related papers (2020-02-22T04:05:16Z) - On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link
Prediction Methods [27.27230441498167]
We take a closer look at the evaluation of two families of methods for enriching information from knowledge graphs: Link Prediction and Entity Alignment.
In particular, we demonstrate that all existing scores can hardly be used to compare results across different datasets.
We show that this leads to various problems in the interpretation of results, which may support misleading conclusions.
arXiv Detail & Related papers (2020-02-17T12:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.