Related papers: How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation

How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation

URL: http://arxiv.org/abs/2106.05532v1
Date: Thu, 10 Jun 2021 06:47:35 GMT
Title: How Robust are Model Rankings: A Leaderboard Customization Approach for Equitable Evaluation
Authors: Swaroop Mishra, Anjana Arunkumar
Abstract summary: Models that top leaderboards often perform unsatisfactorily when deployed in real world applications. We introduce a task-agnostic method to probe leaderboards by weighting samples based on their difficulty level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their `difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance -- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.

Related papers

GRAM: A Generative Foundation Reward Model for Reward Generalization [48.63394690265176]
We develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning.<n>This model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning.
arXiv Detail & Related papers (2025-06-17T04:34:27Z)
Large Language Models Often Know When They Are Being Evaluated [0.015534429177540245]
We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment.<n>We construct a benchmark of 1,000 prompts and transcripts from 61 distinct datasets.<n>Our results indicate that frontier models already exhibit a substantial, though not yet, level of evaluation-awareness.
arXiv Detail & Related papers (2025-05-28T12:03:09Z)
Where is this coming from? Making groundedness count in the evaluation of Document VQA models [12.951716701565019]
We argue that common evaluation metrics do not account for the semantic and multimodal groundedness of a model's outputs. We propose a new evaluation methodology that accounts for the groundedness of predictions. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences.
arXiv Detail & Related papers (2025-03-24T20:14:46Z)
Model Hubs and Beyond: Analyzing Model Popularity, Performance, and Documentation [1.2888930658406668]
We evaluate a comprehensive set of 500 Sentiment Analysis models on Hugging Face. Our findings reveal that model popularity does not necessarily correlate with performance. About 88% of model authors overstate their models' performance in the model cards.
arXiv Detail & Related papers (2025-03-19T14:01:33Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings. We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models. Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose? [11.451691772914055]
This paper presents the first investigation into task-level evaluation. We measure the accuracy of performance estimators in the few-shot setting. We examine the reasons for the failure of evaluators usually thought of as being robust.
arXiv Detail & Related papers (2023-07-06T02:31:38Z)
A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark. Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z)
Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards [5.919860270977038]
We argue that evaluation on a given test dataset is just one of many performance indications of the model. We propose a new paradigm of leaderboard systems that addresses these issues of current leaderboard system.
arXiv Detail & Related papers (2023-03-20T06:13:03Z)
MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels [3.5229503563299915]
This paper investigates the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances.
arXiv Detail & Related papers (2022-12-07T09:38:02Z)
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.