Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models
- URL: http://arxiv.org/abs/2503.06643v1
- Date: Sun, 09 Mar 2025 14:41:18 GMT
- Title: Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models
- Authors: Batu Guan, Xiao Wu, Yuanyuan Yuan, Shaohua Li,
- Abstract summary: We introduce a novel solution, dynamic benchmarking framework, to address this challenge.<n>Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark.
- Score: 19.06241383209599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.
Related papers
- A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench [18.149327897427234]
We present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use.<n>We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English)
arXiv Detail & Related papers (2025-07-11T11:16:01Z) - RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.<n>Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.<n>We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z) - VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation [16.889939234103153]
We propose to variabilize benchmarks and evaluate language models dynamically.
Specifically, we extract variables from each test case and define a value range for each variable.
For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time.
arXiv Detail & Related papers (2024-06-25T16:13:53Z) - Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models [10.482557806309174]
We introduce LexBench, a comprehensive evaluation suite enabled to test language models (LMs) on semantic phrase processing tasks.
Thanks to ourbenchmark, we assess the performance of 15 LMs across model architectures and parameter scales in classification, extraction, and interpretation tasks.
Our benchmarking findings can serve future research aiming to improve the generic capability of LMs on semantic phrase comprehension.
arXiv Detail & Related papers (2024-05-05T09:20:38Z) - The Fault in our Stars: Quality Assessment of Code Generation Benchmarks [0.5137309756089941]
We conduct the first-of-its-kind study of the quality of prompts within benchmarks used to compare the performance of different code generation models.
We analyzed 3,566 prompts from 9 code generation benchmarks to identify quality issues in them.
arXiv Detail & Related papers (2024-04-15T22:02:58Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - A Theory of Dynamic Benchmarks [24.170405353348592]
We study the benefits and practical limitations of dynamic benchmarking.
These results provide a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.
arXiv Detail & Related papers (2022-10-06T18:56:46Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Dynabench: Rethinking Benchmarking in NLP [82.26699038776812]
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation.
We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
arXiv Detail & Related papers (2021-04-07T17:49:17Z) - Do Question Answering Modeling Improvements Hold Across Benchmarks? [84.48867898593052]
We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches.
Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
arXiv Detail & Related papers (2021-02-01T18:55:38Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.