Related papers: How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

URL: http://arxiv.org/abs/2510.06448v1
Date: Tue, 07 Oct 2025 20:38:12 GMT
Title: How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation
Authors: Prabhant Singh, Sibylle Hess, Joaquin Vanschoren,
Abstract summary: Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task.<n>Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined.<n>We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed.
Score: 11.33816414982401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Related papers

Reranking-based Generation for Unbiased Perspective Summarization [10.71668103641552]
We develop a test set for benchmarking metric reliability using human annotations.<n>We show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators.<n>Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
arXiv Detail & Related papers (2025-06-19T00:01:43Z)
Position: All Current Generative Fidelity and Diversity Metrics are Flawed [58.815519650465774]
We show that all current generative fidelity and diversity metrics are flawed.<n>Our aim is to convince the research community to spend more effort in developing metrics, instead of models.
arXiv Detail & Related papers (2025-05-28T15:10:33Z)
Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards [5.632231145349045]
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP. Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details. While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
arXiv Detail & Related papers (2024-11-07T22:36:19Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions [60.06461883533697]
We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
arXiv Detail & Related papers (2024-05-02T13:48:37Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
Exploring validation metrics for offline model-based optimisation with diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)
A critical analysis of metrics used for measuring progress in artificial intelligence [9.387811897655016]
We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results. Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance. We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
arXiv Detail & Related papers (2020-08-06T11:14:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.