Automating Benchmark Design
- URL: http://arxiv.org/abs/2510.25039v1
- Date: Tue, 28 Oct 2025 23:53:36 GMT
- Title: Automating Benchmark Design
- Authors: Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma,
- Abstract summary: We develop BeTaL, a framework that automates the process of dynamic benchmark design.<n>We create two new benchmarks and extend a popular agentic benchmark.<n>BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2%.
- Score: 17.34266257717423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark $\tau$-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.
Related papers
- RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty [102.02839046225468]
RankLLM is a novel framework designed to quantify both question difficulty and model competency.<n>We evaluate 30 models on 35,550 questions across multiple domains.
arXiv Detail & Related papers (2026-02-12T21:28:46Z) - OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling [13.57588221678224]
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling.<n>The boundaries of their capabilities in automated formulation and problem solving remain poorly understood.<n>We propose OPT-ENGINE, a benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels.
arXiv Detail & Related papers (2026-01-09T09:22:33Z) - Benchmark^2: Systematic Evaluation of LLM Benchmarks [66.2731798872668]
We propose Benchmark2, a comprehensive framework comprising three complementary metrics.<n>We conduct experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains.<n>Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction can achieve comparable evaluation performance.
arXiv Detail & Related papers (2026-01-07T14:59:03Z) - PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code [1.1164117387254457]
Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI.<n>Key requirement for these systems is their ability to accurately follow user instructions.<n>We present PACIFIC, a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities.
arXiv Detail & Related papers (2025-12-11T14:49:56Z) - MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z) - MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z) - NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset [16.676904484703]
We introduce NaturalGAIA, a novel benchmark engineered on the principle of Causal Pathways.<n>This paradigm structures complex tasks into a series of verifiable atomic steps, ensuring rigorous, fully automated, and reproducible standard for assessment.<n>We then utilize this dataset to perform Reinforcement FineTuning (RFT) on the Q2.5-VL-7B model.
arXiv Detail & Related papers (2025-08-02T11:53:41Z) - StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns [7.60350050736492]
Long-term memory is essential for large language models to achieve autonomous intelligence.<n>Existing benchmarks face challenges in evaluating knowledge retention and dynamic sequential reasoning.<n>We propose a novel benchmark framework based on interactive fiction games.
arXiv Detail & Related papers (2025-06-16T10:54:31Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks.
It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.