Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
- URL: http://arxiv.org/abs/2407.13696v2
- Date: Thu, 12 Sep 2024 08:36:47 GMT
- Title: Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
- Authors: Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen,
- Abstract summary: We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
- Score: 15.565644819269803
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: github.com/IBM/BenchBench Leaderboard: hf.co/spaces/IBM/BenchBench
Related papers
- IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z) - DEP: A Decentralized Large Language Model Evaluation Protocol [51.3646001384887]
Decentralized Evaluation Protocol (DEP) is a decentralized yet unified and standardized evaluation framework.<n>By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation.<n>We develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control.
arXiv Detail & Related papers (2026-03-01T16:10:16Z) - When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z) - Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts [49.99400612296149]
We find that models can ace many benchmarks without strong visual understanding.<n>This is especially problematic for vision-centric benchmarks that are meant to require visual inputs.<n>We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be gamed.
arXiv Detail & Related papers (2025-11-06T18:43:21Z) - Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks [34.09939383415074]
Benchmark Profiling decomposes benchmark performance into ten cognitively grounded abilities.<n>It explains why performance gains do not always translate into user-perceived competence.
arXiv Detail & Related papers (2025-09-23T15:32:47Z) - Deprecating Benchmarks: Criteria and Framework [2.6449913368815516]
We propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks.<n>Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models.
arXiv Detail & Related papers (2025-07-08T22:29:06Z) - LastingBench: Defend Benchmarks Against Knowledge Leakage [5.476393238638673]
complexity of large language models (LLMs) raises concerns about their ability to "cheat" on standard Question Answering (QA) benchmarks by memorizing task-specific data.<n>This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage.<n>LastingBench is a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage.
arXiv Detail & Related papers (2025-06-21T13:01:04Z) - Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.
Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.
We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs [60.25940747590386]
We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively.
We profiled 274 benchmarks released within the past decade and found concerning issues.
Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source.
arXiv Detail & Related papers (2025-01-18T09:51:57Z) - BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks.
It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z) - How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark [60.72725673114168]
We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets.
We propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark.
arXiv Detail & Related papers (2023-12-21T03:11:30Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet.
Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.