How Robust are Model Rankings: A Leaderboard Customization Approach for
Equitable Evaluation
- URL: http://arxiv.org/abs/2106.05532v1
- Date: Thu, 10 Jun 2021 06:47:35 GMT
- Title: How Robust are Model Rankings: A Leaderboard Customization Approach for
Equitable Evaluation
- Authors: Swaroop Mishra, Anjana Arunkumar
- Abstract summary: Models that top leaderboards often perform unsatisfactorily when deployed in real world applications.
We introduce a task-agnostic method to probe leaderboards by weighting samples based on their difficulty level.
We find that leaderboards can be adversarially attacked and top performing models may not always be the best models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models that top leaderboards often perform unsatisfactorily when deployed in
real world applications; this has necessitated rigorous and expensive
pre-deployment model testing. A hitherto unexplored facet of model performance
is: Are our leaderboards doing equitable evaluation? In this paper, we
introduce a task-agnostic method to probe leaderboards by weighting samples
based on their `difficulty' level. We find that leaderboards can be
adversarially attacked and top performing models may not always be the best
models. We subsequently propose alternate evaluation metrics. Our experiments
on 10 models show changes in model ranking and an overall reduction in
previously reported performance -- thus rectifying the overestimation of AI
systems' capabilities. Inspired by behavioral testing principles, we further
develop a prototype of a visual analytics tool that enables leaderboard
revamping through customization, based on an end user's focus area. This helps
users analyze models' strengths and weaknesses, and guides them in the
selection of a model best suited for their application scenario. In a user
study, members of various commercial product development teams, covering 5
focus areas, find that our prototype reduces pre-deployment development and
testing effort by 41% on average.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.
We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit
for Purpose? [11.451691772914055]
This paper presents the first investigation into task-level evaluation.
We measure the accuracy of performance estimators in the few-shot setting.
We examine the reasons for the failure of evaluators usually thought of as being robust.
arXiv Detail & Related papers (2023-07-06T02:31:38Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural
Language Processing Leaderboards [5.919860270977038]
We argue that evaluation on a given test dataset is just one of many performance indications of the model.
We propose a new paradigm of leaderboard systems that addresses these issues of current leaderboard system.
arXiv Detail & Related papers (2023-03-20T06:13:03Z) - MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels [3.5229503563299915]
This paper investigates the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis.
Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances.
arXiv Detail & Related papers (2022-12-07T09:38:02Z) - Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.