How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
- URL: http://arxiv.org/abs/2501.10711v3
- Date: Mon, 17 Feb 2025 13:49:45 GMT
- Title: How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
- Authors: Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung,
- Abstract summary: We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively.
We profiled 274 benchmarks released within the past decade and found concerning issues.
Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source.
- Score: 60.25940747590386
- License:
- Abstract: Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.
Related papers
- Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.
Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.
We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark [57.999567012489706]
We propose a contamination-free and more challenging benchmark called MMLU-CF.
This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage.
Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set.
arXiv Detail & Related papers (2024-12-19T18:58:04Z) - Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results.
We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - The Fault in our Stars: Quality Assessment of Code Generation Benchmarks [0.5137309756089941]
We conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models.
We analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them.
arXiv Detail & Related papers (2024-04-15T22:02:58Z) - TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs [12.839640915518443]
Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost.
Recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs.
We propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.
arXiv Detail & Related papers (2024-03-01T09:28:38Z) - A Review of Benchmarks for Visual Defect Detection in the Manufacturing
Industry [63.52264764099532]
We propose a study of existing benchmarks to compare and expose their characteristics and their use-cases.
A study of industrial metrics requirements, as well as testing procedures, will be presented and applied to the studied benchmarks.
arXiv Detail & Related papers (2023-05-05T07:44:23Z) - Benchmarks for Automated Commonsense Reasoning: A Survey [0.0]
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems.
This paper surveys the development and uses of AI commonsense benchmarks.
arXiv Detail & Related papers (2023-02-09T16:34:30Z) - What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet.
Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.