Dynabench: Rethinking Benchmarking in NLP
        - URL: http://arxiv.org/abs/2104.14337v1
- Date: Wed, 7 Apr 2021 17:49:17 GMT
- Title: Dynabench: Rethinking Benchmarking in NLP
- Authors: Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger,
  Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia,
  Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp,
  Robin Jia, Mohit Bansal, Christopher Potts, Adina Williams
- Abstract summary: We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation.
We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
- Score: 82.26699038776812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   We introduce Dynabench, an open-source platform for dynamic dataset creation
and model benchmarking. Dynabench runs in a web browser and supports
human-and-model-in-the-loop dataset creation: annotators seek to create
examples that a target model will misclassify, but that another person will
not. In this paper, we argue that Dynabench addresses a critical need in our
community: contemporary models quickly achieve outstanding performance on
benchmark tasks but nonetheless fail on simple challenge examples and falter in
real-world scenarios. With Dynabench, dataset creation, model development, and
model assessment can directly inform each other, leading to more robust and
informative benchmarks. We report on four initial NLP tasks, illustrating these
concepts and highlighting the promise of the platform, and address potential
objections to dynamic benchmarking as a new standard for the field.
 
      
        Related papers
        - RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
 Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
 arXiv  Detail & Related papers  (2025-06-02T17:54:04Z)
- Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic   Evaluation of Language Models [24.481028155002523]
 We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task.
ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation.
It is scalable to tasks and languages where collecting real-world data is costly or impractical.
 arXiv  Detail & Related papers  (2025-04-01T17:40:08Z)
- Towards Robust Universal Information Extraction: Benchmark, Evaluation,   and Solution [66.11004226578771]
 Existing robust benchmark datasets have two key limitations.
They generate only a limited range of perturbations for a single Information Extraction (IE) task.
Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.
We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
 arXiv  Detail & Related papers  (2025-03-05T05:39:29Z)
- BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
 We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities.
We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation.
We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
 arXiv  Detail & Related papers  (2024-10-29T22:56:18Z)
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
 We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
 arXiv  Detail & Related papers  (2024-10-14T17:51:23Z)
- Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
 Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
 arXiv  Detail & Related papers  (2024-07-22T17:52:12Z)
- A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
 We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
 arXiv  Detail & Related papers  (2024-04-15T00:03:16Z)
- Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
  Evaluation [51.99752147380505]
 This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
 arXiv  Detail & Related papers  (2024-02-18T03:40:06Z)
- Open World Object Detection in the Era of Foundation Models [53.683963161370585]
 We introduce a new benchmark that includes five real-world application-driven datasets.
We introduce a novel method, Foundation Object detection Model for the Open world, or FOMO, which identifies unknown objects based on their shared attributes with the base known objects.
 arXiv  Detail & Related papers  (2023-12-10T03:56:06Z)
- Dynaboard: An Evaluation-As-A-Service Platform for Holistic
  Next-Generation Benchmarking [41.99715850562528]
 We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
 arXiv  Detail & Related papers  (2021-05-21T01:17:52Z)
- DynaSent: A Dynamic Benchmark for Sentiment Analysis [31.724648265584445]
 We introduce DynaSent, a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis.
DynaSent combines naturally occurring sentences with sentences created using the open-source Dynabench Platform.
It has a total of 121,634 sentences, each validated by five crowdworkers.
 arXiv  Detail & Related papers  (2020-12-30T22:38:21Z)
- Benchmarking Robustness of Machine Reading Comprehension Models [29.659586787812106]
 We construct AdvRACE, a new model-agnostic benchmark for evaluating the robustness of MRC models under four different types of adversarial attacks.
We show that state-of-the-art (SOTA) models are vulnerable to all of these attacks.
We conclude that there is substantial room for building more robust MRC models and our benchmark can help motivate and measure progress in this area.
 arXiv  Detail & Related papers  (2020-04-29T08:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.