Dynabench: Rethinking Benchmarking in NLP
- URL: http://arxiv.org/abs/2104.14337v1
- Date: Wed, 7 Apr 2021 17:49:17 GMT
- Title: Dynabench: Rethinking Benchmarking in NLP
- Authors: Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger,
Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia,
Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp,
Robin Jia, Mohit Bansal, Christopher Potts, Adina Williams
- Abstract summary: We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation.
We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
- Score: 82.26699038776812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Dynabench, an open-source platform for dynamic dataset creation
and model benchmarking. Dynabench runs in a web browser and supports
human-and-model-in-the-loop dataset creation: annotators seek to create
examples that a target model will misclassify, but that another person will
not. In this paper, we argue that Dynabench addresses a critical need in our
community: contemporary models quickly achieve outstanding performance on
benchmark tasks but nonetheless fail on simple challenge examples and falter in
real-world scenarios. With Dynabench, dataset creation, model development, and
model assessment can directly inform each other, leading to more robust and
informative benchmarks. We report on four initial NLP tasks, illustrating these
concepts and highlighting the promise of the platform, and address potential
objections to dynamic benchmarking as a new standard for the field.
Related papers
- BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities.
We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation.
We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - Open World Object Detection in the Era of Foundation Models [53.683963161370585]
We introduce a new benchmark that includes five real-world application-driven datasets.
We introduce a novel method, Foundation Object detection Model for the Open world, or FOMO, which identifies unknown objects based on their shared attributes with the base known objects.
arXiv Detail & Related papers (2023-12-10T03:56:06Z) - Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z) - DynaSent: A Dynamic Benchmark for Sentiment Analysis [31.724648265584445]
We introduce DynaSent, a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis.
DynaSent combines naturally occurring sentences with sentences created using the open-source Dynabench Platform.
It has a total of 121,634 sentences, each validated by five crowdworkers.
arXiv Detail & Related papers (2020-12-30T22:38:21Z) - Benchmarking Robustness of Machine Reading Comprehension Models [29.659586787812106]
We construct AdvRACE, a new model-agnostic benchmark for evaluating the robustness of MRC models under four different types of adversarial attacks.
We show that state-of-the-art (SOTA) models are vulnerable to all of these attacks.
We conclude that there is substantial room for building more robust MRC models and our benchmark can help motivate and measure progress in this area.
arXiv Detail & Related papers (2020-04-29T08:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.