Absolute Evaluation Measures for Machine Learning: A Survey
        - URL: http://arxiv.org/abs/2507.03392v1
 - Date: Fri, 04 Jul 2025 08:53:08 GMT
 - Title: Absolute Evaluation Measures for Machine Learning: A Survey
 - Authors: Silvia Beddar-Wiesing, Alice Moallemy-Oureh, Marie Kempkes, Josephine M. Thomas, 
 - Abstract summary: This survey provides an overview of absolute evaluation metrics in Machine Learning.<n>It is organized by the type of learning problem and covers clustering, regression, and ranking metrics.<n>It aims to equip practitioners with the tools necessary to select appropriate metrics for their models.
 - Score: 0.0
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Machine Learning is a diverse field applied across various domains such as computer science, social sciences, medicine, chemistry, and finance. This diversity results in varied evaluation approaches, making it difficult to compare models effectively. Absolute evaluation measures offer a practical solution by assessing a model's performance on a fixed scale, independent of reference models and data ranges, enabling explicit comparisons. However, many commonly used measures are not universally applicable, leading to a lack of comprehensive guidance on their appropriate use. This survey addresses this gap by providing an overview of absolute evaluation metrics in ML, organized by the type of learning problem. While classification metrics have been extensively studied, this work also covers clustering, regression, and ranking metrics. By grouping these measures according to the specific ML challenges they address, this survey aims to equip practitioners with the tools necessary to select appropriate metrics for their models. The provided overview thus improves individual model evaluation and facilitates meaningful comparisons across different models and applications. 
 
       
      
        Related papers
        - Statistical Uncertainty Quantification for Aggregate Performance Metrics   in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks.<n>These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv  Detail & Related papers  (2025-01-08T02:17:34Z) - Benchmark for Evaluation and Analysis of Citation Recommendation Models [0.0]
We develop a benchmark specifically designed to analyze and compare citation recommendation models.<n>This benchmark will evaluate the performance of models on different features of the citation context.<n>This will enable meaningful comparisons and help identify promising approaches for further research and development in the field.
arXiv  Detail & Related papers  (2024-12-10T18:01:33Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question   Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.<n>Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.<n>We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv  Detail & Related papers  (2024-12-09T13:05:43Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv  Detail & Related papers  (2024-11-22T18:59:54Z) - Area under the ROC Curve has the Most Consistent Evaluation for Binary   Classification [3.1850615666574806]
This study investigates how consistent different metrics are at evaluating models across data of different prevalence.
I find that evaluation metrics that are less influenced by prevalence offer more consistent evaluation of individual models and more consistent ranking of a set of models.
arXiv  Detail & Related papers  (2024-08-19T17:52:38Z) - Quantifying Variance in Evaluation Benchmarks [34.12254884944099]
We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training.
We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale.
More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
arXiv  Detail & Related papers  (2024-06-14T17:59:54Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv  Detail & Related papers  (2023-11-03T14:59:54Z) - Learning Evaluation Models from Large Language Models for Sequence   Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv  Detail & Related papers  (2023-08-08T16:41:16Z) - Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals.
Model-to-Match uses variable importance measurements to construct a distance metric.
We operationalize the Model-to-Match framework with LASSO.
arXiv  Detail & Related papers  (2023-02-23T00:43:03Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
  Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv  Detail & Related papers  (2023-02-06T16:55:37Z) - On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link
  Prediction Methods [27.27230441498167]
We take a closer look at the evaluation of two families of methods for enriching information from knowledge graphs: Link Prediction and Entity Alignment.
In particular, we demonstrate that all existing scores can hardly be used to compare results across different datasets.
We show that this leads to various problems in the interpretation of results, which may support misleading conclusions.
arXiv  Detail & Related papers  (2020-02-17T12:26:14Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.