How Robust are Model Rankings: A Leaderboard Customization Approach for
Equitable Evaluation
- URL: http://arxiv.org/abs/2106.05532v1
- Date: Thu, 10 Jun 2021 06:47:35 GMT
- Title: How Robust are Model Rankings: A Leaderboard Customization Approach for
Equitable Evaluation
- Authors: Swaroop Mishra, Anjana Arunkumar
- Abstract summary: Models that top leaderboards often perform unsatisfactorily when deployed in real world applications.
We introduce a task-agnostic method to probe leaderboards by weighting samples based on their difficulty level.
We find that leaderboards can be adversarially attacked and top performing models may not always be the best models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Models that top leaderboards often perform unsatisfactorily when deployed in
real world applications; this has necessitated rigorous and expensive
pre-deployment model testing. A hitherto unexplored facet of model performance
is: Are our leaderboards doing equitable evaluation? In this paper, we
introduce a task-agnostic method to probe leaderboards by weighting samples
based on their `difficulty' level. We find that leaderboards can be
adversarially attacked and top performing models may not always be the best
models. We subsequently propose alternate evaluation metrics. Our experiments
on 10 models show changes in model ranking and an overall reduction in
previously reported performance -- thus rectifying the overestimation of AI
systems' capabilities. Inspired by behavioral testing principles, we further
develop a prototype of a visual analytics tool that enables leaderboard
revamping through customization, based on an end user's focus area. This helps
users analyze models' strengths and weaknesses, and guides them in the
selection of a model best suited for their application scenario. In a user
study, members of various commercial product development teams, covering 5
focus areas, find that our prototype reduces pre-deployment development and
testing effort by 41% on average.
Related papers
- Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
How to evaluate large vision-language models remains a major obstacle, hindering future model development.
Traditional benchmarks provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics.
Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias.
MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit
for Purpose? [11.451691772914055]
This paper presents the first investigation into task-level evaluation.
We measure the accuracy of performance estimators in the few-shot setting.
We examine the reasons for the failure of evaluators usually thought of as being robust.
arXiv Detail & Related papers (2023-07-06T02:31:38Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural
Language Processing Leaderboards [5.919860270977038]
We argue that evaluation on a given test dataset is just one of many performance indications of the model.
We propose a new paradigm of leaderboard systems that addresses these issues of current leaderboard system.
arXiv Detail & Related papers (2023-03-20T06:13:03Z) - MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels [3.5229503563299915]
This paper investigates the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis.
Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances.
arXiv Detail & Related papers (2022-12-07T09:38:02Z) - Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.