What Will it Take to Fix Benchmarking in Natural Language Understanding?
- URL: http://arxiv.org/abs/2104.02145v1
- Date: Mon, 5 Apr 2021 20:36:11 GMT
- Title: What Will it Take to Fix Benchmarking in Natural Language Understanding?
- Authors: Samuel R. Bowman and George E. Dahl
- Abstract summary: We lay out four criteria that we argue NLU benchmarks should meet.
Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
- Score: 30.888416756627155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluation for many natural language understanding (NLU) tasks is broken:
Unreliable and biased systems score so highly on standard benchmarks that there
is little room for researchers who develop better systems to demonstrate their
improvements. The recent trend to abandon IID benchmarks in favor of
adversarially-constructed, out-of-distribution test sets ensures that current
models will perform poorly, but ultimately only obscures the abilities that we
want our benchmarks to measure. In this position paper, we lay out four
criteria that we argue NLU benchmarks should meet. We argue most current
benchmarks fail at these criteria, and that adversarial data collection does
not meaningfully address the causes of these failures. Instead, restoring a
healthy evaluation ecosystem will require significant progress in the design of
benchmark datasets, the reliability with which they are annotated, their size,
and the ways they handle social bias.
Related papers
- PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.
This poses a significant challenge to ensuring their safe deployment.
We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.
Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.
Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.
We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs [60.25940747590386]
We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively.
We profiled 274 benchmarks released within the past decade and found concerning issues.
Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source.
arXiv Detail & Related papers (2025-01-18T09:51:57Z) - BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - A Comparative Analysis on Ethical Benchmarking in Large Language Models [0.0]
This work contributes to the field of Machine Ethics (ME) benchmarking, which develops tests to assess whether intelligent systems accurately represent human values and act accordingly.
We identify three major issues with current ME benchmarks: limited ecological validity due to unrealistic ethical dilemmas, unstructured question generation without clear inclusion/exclusion criteria, and a lack of scalability due to reliance on human annotations.
We introduce two new ME benchmarks: the Triage Benchmark and the Medical Law (MedLaw) Benchmark, both featuring real-world ethical dilemmas from the medical domain.
arXiv Detail & Related papers (2024-10-11T05:05:21Z) - Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.