Related papers: How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

URL: http://arxiv.org/abs/2501.10711v3
Date: Mon, 17 Feb 2025 13:49:45 GMT
Title: How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
Authors: Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung,
Abstract summary: We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively.<n>We profiled 274 benchmarks released within the past decade and found concerning issues.<n>Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source.
Score: 60.25940747590386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.

Related papers

Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study [7.0773305889955616]
Large Language Models (LLMs) have shown impressive performance in code generation.<n>LLMs must understand and apply a wide range of language concepts.<n>If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete.
arXiv Detail & Related papers (2026-01-07T10:23:33Z)
CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval [31.817325318218003]
CoQuIR is the first large-scale, multilingual benchmark designed to evaluate quality-aware code retrieval.<n>CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages.
arXiv Detail & Related papers (2025-05-31T13:00:17Z)
VERINA: Benchmarking Verifiable Code Generation [47.9771074559674]
Large language models (LLMs) are increasingly integrated in software development.<n>Verifiable code generation offers a promising path to address this limitation.<n>Current benchmarks often lack support for end-to-end verifiable code generation.
arXiv Detail & Related papers (2025-05-29T06:12:52Z)
Information Density Principle for MLLM Benchmarks [59.88484827926759]
We propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density.
arXiv Detail & Related papers (2025-03-13T05:58:41Z)
Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality [4.213480330807674]
We conduct a review of 247 studies identifying 273 AI4SE benchmarks since 2014.<n>We categorize them, expose gaps in current practices, and introduce BenchScout, an semantic search tool for locating suitable benchmarks.<n>In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5.
arXiv Detail & Related papers (2025-03-07T18:44:32Z)
Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability. Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning [72.57452266982642]
OCRBench v2 is a large-scale bilingual text-centric benchmark.<n>It covers 31 diverse scenarios, 10,000 human-verified question-answering pairs, and thorough evaluation metrics.<n>We find that most LMMs score below 50 (100 in total) and suffer from five-type limitations.
arXiv Detail & Related papers (2024-12-31T07:32:35Z)
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark [57.999567012489706]
We propose a contamination-free and more challenging benchmark called MMLU-CF.<n>This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage.<n>Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set.
arXiv Detail & Related papers (2024-12-19T18:58:04Z)
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z)
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench [15.565644819269803]
We show how some overlooked methodological choices can significantly influence Benchmark Agreement Testing (BAT) results. We introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers.
arXiv Detail & Related papers (2024-07-18T17:00:23Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. Each module requires benchmark designers to describe, justify, and support benchmark design choices. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
The Fault in our Stars: Quality Assessment of Code Generation Benchmarks [0.5137309756089941]
We conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models. We analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them.
arXiv Detail & Related papers (2024-04-15T22:02:58Z)
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs [12.839640915518443]
Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost. Recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs. We propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.
arXiv Detail & Related papers (2024-03-01T09:28:38Z)
A Review of Benchmarks for Visual Defect Detection in the Manufacturing Industry [63.52264764099532]
We propose a study of existing benchmarks to compare and expose their characteristics and their use-cases. A study of industrial metrics requirements, as well as testing procedures, will be presented and applied to the studied benchmarks.
arXiv Detail & Related papers (2023-05-05T07:44:23Z)
Benchmarks for Automated Commonsense Reasoning: A Survey [0.0]
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems. This paper surveys the development and uses of AI commonsense benchmarks.
arXiv Detail & Related papers (2023-02-09T16:34:30Z)
What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet. Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.