What Will it Take to Fix Benchmarking in Natural Language Understanding?
- URL: http://arxiv.org/abs/2104.02145v1
- Date: Mon, 5 Apr 2021 20:36:11 GMT
- Title: What Will it Take to Fix Benchmarking in Natural Language Understanding?
- Authors: Samuel R. Bowman and George E. Dahl
- Abstract summary: We lay out four criteria that we argue NLU benchmarks should meet.
Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
- Score: 30.888416756627155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluation for many natural language understanding (NLU) tasks is broken:
Unreliable and biased systems score so highly on standard benchmarks that there
is little room for researchers who develop better systems to demonstrate their
improvements. The recent trend to abandon IID benchmarks in favor of
adversarially-constructed, out-of-distribution test sets ensures that current
models will perform poorly, but ultimately only obscures the abilities that we
want our benchmarks to measure. In this position paper, we lay out four
criteria that we argue NLU benchmarks should meet. We argue most current
benchmarks fail at these criteria, and that adversarial data collection does
not meaningfully address the causes of these failures. Instead, restoring a
healthy evaluation ecosystem will require significant progress in the design of
benchmark datasets, the reliability with which they are annotated, their size,
and the ways they handle social bias.
Related papers
- BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - A Comparative Analysis on Ethical Benchmarking in Large Language Models [0.0]
This work contributes to the field of Machine Ethics (ME) benchmarking, which develops tests to assess whether intelligent systems accurately represent human values and act accordingly.
We identify three major issues with current ME benchmarks: limited ecological validity due to unrealistic ethical dilemmas, unstructured question generation without clear inclusion/exclusion criteria, and a lack of scalability due to reliance on human annotations.
We introduce two new ME benchmarks: the Triage Benchmark and the Medical Law (MedLaw) Benchmark, both featuring real-world ethical dilemmas from the medical domain.
arXiv Detail & Related papers (2024-10-11T05:05:21Z) - Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Benchmarks for Automated Commonsense Reasoning: A Survey [0.0]
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems.
This paper surveys the development and uses of AI commonsense benchmarks.
arXiv Detail & Related papers (2023-02-09T16:34:30Z) - Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies.
We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z) - Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.