Related papers: Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality

URL: http://arxiv.org/abs/2503.05860v2
Date: Tue, 28 Oct 2025 10:40:52 GMT
Title: Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality
Authors: Roham Koohestani, Philippe de Bekker, Begüm Koç, Maliheh Izadi,
Abstract summary: We conduct a review of 247 studies identifying 273 AI4SE benchmarks since 2014.<n>We categorize them, expose gaps in current practices, and introduce BenchScout, an semantic search tool for locating suitable benchmarks.<n>In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5.
Score: 4.213480330807674
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified framework for enhancing benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, featuring corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame's scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. All review data, user study materials, and enhanced benchmarks are publicly released.

Related papers

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z)
Benchmark^2: Systematic Evaluation of LLM Benchmarks [66.2731798872668]
We propose Benchmark2, a comprehensive framework comprising three complementary metrics.<n>We conduct experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains.<n>Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction can achieve comparable evaluation performance.
arXiv Detail & Related papers (2026-01-07T14:59:03Z)
AI Benchmark Democratization and Carpentry [12.180796797521062]
Large language models often static benchmarks, causing a gap between benchmark results and real-world performance.<n>Current benchmarks often emphasize peak performance on top-tier hardware, offering limited guidance for diverse, real-world scenarios.<n>Democratization requires both technical innovation and systematic education across levels, building sustained expertise in benchmark design and use.
arXiv Detail & Related papers (2025-12-12T14:20:05Z)
Fantastic Bugs and Where to Find Them in AI Benchmarks [28.604919035475188]
We introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions.<n>Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance.<n>Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84% precision.
arXiv Detail & Related papers (2025-11-20T22:49:21Z)
Deprecating Benchmarks: Criteria and Framework [2.6449913368815516]
We propose criteria to decide when to fully or partially deprecate benchmarks, and a framework for deprecating benchmarks.<n>Our work aims to advance the state of benchmarking towards rigorous and quality evaluations, especially for frontier models.
arXiv Detail & Related papers (2025-07-08T22:29:06Z)
RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z)
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks [47.40240774236047]
We compare four Chat Llama 2 models against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences.
arXiv Detail & Related papers (2025-02-24T01:01:02Z)
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs [60.25940747590386]
We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. We profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source.
arXiv Detail & Related papers (2025-01-18T09:51:57Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it. We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Introducing v0.5 of the AI Safety Benchmark from MLCommons [101.98401637778638]
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models.
arXiv Detail & Related papers (2024-04-18T15:01:00Z)
ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z)
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003. GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z)
Benchmarks for Automated Commonsense Reasoning: A Survey [0.0]
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems. This paper surveys the development and uses of AI commonsense benchmarks.
arXiv Detail & Related papers (2023-02-09T16:34:30Z)
What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet. Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.