Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking
- URL: http://arxiv.org/abs/2106.06052v1
- Date: Fri, 21 May 2021 01:17:52 GMT
- Title: Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking
- Authors: Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu,
Robin Jia, Christopher Potts, Adina Williams, Douwe Kiela
- Abstract summary: We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
- Score: 41.99715850562528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Dynaboard, an evaluation-as-a-service framework for hosting
benchmarks and conducting holistic model comparison, integrated with the
Dynabench platform. Our platform evaluates NLP models directly instead of
relying on self-reported metrics or predictions on a single dataset. Under this
paradigm, models are submitted to be evaluated in the cloud, circumventing the
issues of reproducibility, accessibility, and backwards compatibility that
often hinder benchmarking in NLP. This allows users to interact with uploaded
models in real time to assess their quality, and permits the collection of
additional metrics such as memory use, throughput, and robustness, which --
despite their importance to practitioners -- have traditionally been absent
from leaderboards. On each task, models are ranked according to the Dynascore,
a novel utility-based aggregation of these statistics, which users can
customize to better reflect their preferences, placing more/less weight on a
particular axis of evaluation or dataset. As state-of-the-art NLP models push
the limits of traditional benchmarks, Dynaboard offers a standardized solution
for a more diverse and comprehensive evaluation of model quality.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis [14.526536510805755]
We present a comprehensive framework for predicting the effects of perturbations in single cells, designed to standardize benchmarking in this rapidly evolving field.
Our framework, PerturBench, includes a user-friendly platform, diverse datasets, metrics for fair model comparison, and detailed performance analysis.
arXiv Detail & Related papers (2024-08-20T07:40:20Z) - An Optimism-based Approach to Online Evaluation of Generative Models [23.91197677628145]
We propose an online evaluation framework to find the generative model that maximizes a standard assessment score among a group of available models.
Specifically, we study the online assessment of generative models based on the Fr'echet Inception Distance (FID) and Inception Score (IS) metrics.
arXiv Detail & Related papers (2024-06-11T16:57:48Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Improving Label Quality by Jointly Modeling Items and Annotators [68.8204255655161]
We propose a fully Bayesian framework for learning ground truth labels from noisy annotators.
Our framework ensures scalability by factoring a generative, Bayesian soft clustering model over label distributions into the classic David and Skene joint annotator-data model.
arXiv Detail & Related papers (2021-06-20T02:15:20Z) - How Robust are Model Rankings: A Leaderboard Customization Approach for
Equitable Evaluation [0.0]
Models that top leaderboards often perform unsatisfactorily when deployed in real world applications.
We introduce a task-agnostic method to probe leaderboards by weighting samples based on their difficulty level.
We find that leaderboards can be adversarially attacked and top performing models may not always be the best models.
arXiv Detail & Related papers (2021-06-10T06:47:35Z) - Dynabench: Rethinking Benchmarking in NLP [82.26699038776812]
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation.
We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
arXiv Detail & Related papers (2021-04-07T17:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.