Does the evaluation stand up to evaluation? A first-principle approach
to the evaluation of classifiers
- URL: http://arxiv.org/abs/2302.12006v1
- Date: Tue, 21 Feb 2023 09:55:19 GMT
- Title: Does the evaluation stand up to evaluation? A first-principle approach
to the evaluation of classifiers
- Authors: K. Dyrland, A. S. Lundervold, P.G.L. Porta Mana
- Abstract summary: It is shown that popular metrics such as precision, balanced accuracy, Matthews Correlation Coefficient, Fowlkes-Mallows index, F1-measure, and Area Under the Curve are never optimal.
This fraction is even larger than would be caused by the use of a decision-theoretic metric with moderately wrong coefficients.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How can one meaningfully make a measurement, if the meter does not conform to
any standard and its scale expands or shrinks depending on what is measured? In
the present work it is argued that current evaluation practices for
machine-learning classifiers are affected by this kind of problem, leading to
negative consequences when classifiers are put to real use; consequences that
could have been avoided. It is proposed that evaluation be grounded on Decision
Theory, and the implications of such foundation are explored. The main result
is that every evaluation metric must be a linear combination of
confusion-matrix elements, with coefficients - "utilities" - that depend on the
specific classification problem. For binary classification, the space of such
possible metrics is effectively two-dimensional. It is shown that popular
metrics such as precision, balanced accuracy, Matthews Correlation Coefficient,
Fowlkes-Mallows index, F1-measure, and Area Under the Curve are never optimal:
they always give rise to an in-principle avoidable fraction of incorrect
evaluations. This fraction is even larger than would be caused by the use of a
decision-theoretic metric with moderately wrong coefficients.
Related papers
- Significativity Indices for Agreement Values [0.0]
Agreement measures, such as Cohen's kappa or intraclass correlation, gauge the matching between two or more classifiers.
Some quality scales have been proposed in the literature for Cohen's kappa, but they are mainly naive, and their boundaries are arbitrary.
This work proposes a general approach to evaluate the significativity of any agreement value between two classifiers.
arXiv Detail & Related papers (2025-04-21T09:47:53Z) - A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice [6.091702876917282]
Classification systems are evaluated in a countless number of papers.
However, we find that evaluation practice is often nebulous.
Many works use so-called'macro' metrics to rank systems but do not clearly specify what they would expect from such a metric.
arXiv Detail & Related papers (2024-04-25T18:12:43Z) - $F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$.
It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - Monotonicity and Double Descent in Uncertainty Estimation with Gaussian
Processes [52.92110730286403]
It is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions.
We prove that by tuning hyper parameters, the performance, as measured by the marginal likelihood, improves monotonically with the input dimension.
We also prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
arXiv Detail & Related papers (2022-10-14T08:09:33Z) - Optimizing Partial Area Under the Top-k Curve: Theory and Practice [151.5072746015253]
We develop a novel metric named partial Area Under the top-k Curve (AUTKC)
AUTKC has a better discrimination ability, and its Bayes optimal score function could give a correct top-K ranking with respect to the conditional probability.
We present an empirical surrogate risk minimization framework to optimize the proposed metric.
arXiv Detail & Related papers (2022-09-03T11:09:13Z) - Benign Overfitting in Adversarially Robust Linear Classification [91.42259226639837]
"Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community.
We show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples.
arXiv Detail & Related papers (2021-12-31T00:27:31Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Classifier uncertainty: evidence, potential impact, and probabilistic
treatment [0.0]
We present an approach to quantify the uncertainty of classification performance metrics based on a probability model of the confusion matrix.
We show that uncertainties can be surprisingly large and limit performance evaluation.
arXiv Detail & Related papers (2020-06-19T12:49:19Z) - An Effectiveness Metric for Ordinal Classification: Formal Properties
and Experimental Results [9.602361044877426]
We propose a new metric for Ordinal Classification, Closeness Evaluation Measure, rooted on Measurement Theory and Information Theory.
Our theoretical analysis and experimental results over both synthetic data and data from NLP shared tasks indicate that the proposed metric captures quality aspects from different traditional tasks simultaneously.
arXiv Detail & Related papers (2020-06-01T20:35:46Z) - Fractional norms and quasinorms do not help to overcome the curse of
dimensionality [62.997667081978825]
Using of the Manhattan distance and even fractional quasinorms lp can help to overcome the curse of dimensionality in classification problems.
A systematic comparison shows that the difference of the performance of kNN based on lp for p=2, 1, and 0.5 is statistically insignificant.
arXiv Detail & Related papers (2020-04-29T14:30:12Z) - On Model Evaluation under Non-constant Class Imbalance [0.0]
Many real-world classification problems are significantly class-imbalanced to detriment of the class of interest.
The usual assumption is that the test dataset imbalance equals the real-world imbalance.
We introduce methods focusing on evaluation under non-constant class imbalance.
arXiv Detail & Related papers (2020-01-15T21:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.