Related papers: Benchmarking that Matters: Rethinking Benchmarking for Practical Impact

Benchmarking that Matters: Rethinking Benchmarking for Practical Impact

URL: http://arxiv.org/abs/2511.12264v1
Date: Sat, 15 Nov 2025 15:42:15 GMT
Title: Benchmarking that Matters: Rethinking Benchmarking for Practical Impact
Authors: Anna V. Kononova, Niki van Stein, Olaf Mersmann, Thomas Bäck, Thomas Bartz-Beielstein, Tobias Glasmachers, Michael Hellwig, Sebastian Krey, Jakub Kůdela, Boris Naujoks, Leonard Papenmeier, Elena Raponi, Quentin Renau, Jeroen Rook, Lennart Schäpermeier, Diederick Vermetten, Daniela Zaharie,
Abstract summary: We propose a vision centered on curated real-world-inspired benchmarks, practitioner-accessible feature spaces and community-maintained performance databases.<n>Real progress requires coordinated effort: A living benchmarking ecosystem that evolves with real-world insights and supports both scientific understanding and industrial use.
Score: 2.952553461344481
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarking has driven scientific progress in Evolutionary Computation, yet current practices fall short of real-world needs. Widely used synthetic suites such as BBOB and CEC isolate algorithmic phenomena but poorly reflect the structure, constraints, and information limitations of continuous and mixed-integer optimization problems in practice. This disconnect leads to the misuse of benchmarking suites for competitions, automated algorithm selection, and industrial decision-making, despite these suites being designed for different purposes. We identify key gaps in current benchmarking practices and tooling, including limited availability of real-world-inspired problems, missing high-level features, and challenges in multi-objective and noisy settings. We propose a vision centered on curated real-world-inspired benchmarks, practitioner-accessible feature spaces and community-maintained performance databases. Real progress requires coordinated effort: A living benchmarking ecosystem that evolves with real-world insights and supports both scientific understanding and industrial use.

Related papers

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? [61.247730037229815]
We introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope.<n>To investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities.<n>This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
arXiv Detail & Related papers (2026-03-03T17:52:01Z)
Easy Data Unlearning Bench [53.1304932656586]
We introduce a unified and benchmarking suite that simplifies the evaluation of unlearning algorithms.<n>By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods.
arXiv Detail & Related papers (2026-02-18T12:20:32Z)
Benchmarking Agents in Insurance Underwriting Environments [0.9728664856449597]
Existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity.<n>We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts.
arXiv Detail & Related papers (2026-01-31T02:12:11Z)
Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering [19.584762693453893]
BEHELM is a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation.<n>Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.
arXiv Detail & Related papers (2026-01-28T21:55:10Z)
InfoSynth: Information-Guided Benchmark Synthesis for LLMs [69.80981631587501]
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation.<n>Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming.<n>This work introduces Info Synth, a novel framework for automatically generating and evaluating reasoning benchmarks.
arXiv Detail & Related papers (2026-01-02T05:26:27Z)
AI Benchmark Democratization and Carpentry [12.180796797521062]
Large language models often static benchmarks, causing a gap between benchmark results and real-world performance.<n>Current benchmarks often emphasize peak performance on top-tier hardware, offering limited guidance for diverse, real-world scenarios.<n>Democratization requires both technical innovation and systematic education across levels, building sustained expertise in benchmark design and use.
arXiv Detail & Related papers (2025-12-12T14:20:05Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
A Benchmark Suite for Multi-Objective Optimization in Battery Thermal Management System Design [0.0]
This study develops and presents a specialized benchmark suite for multi-objective optimization in Battery Thermal Management System (BTMS) design.<n>The primary goal of this benchmark suite is to provide a practical and relevant testing ground for evolutionary algorithms and optimization methods.
arXiv Detail & Related papers (2025-10-29T06:48:22Z)
Metrics and evaluations for computational and sustainable AI efficiency [26.52588349722099]
Current approaches fail to provide a holistic view, making it difficult to compare and optimise systems.<n>We propose a unified and reproducible methodology for AI model inference that integrates computational and environmental metrics.<n>Our framework provides pragmatic, carbon-aware evaluation by systematically measuring latency and distributions throughput, energy consumption, and location-adjusted carbon emissions.
arXiv Detail & Related papers (2025-10-18T03:30:15Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection [40.174488947319645]
Anomaly detection (AD) is essential for automating visual inspection in manufacturing.<n>This paper makes three key contributions: (1) we demonstrate the importance of real-world datasets and establish benchmarks using actual production data; (2) we provide a fair comparison of existing SOTA methods across diverse tasks by utilizing metrics that are valuable for practical applications; and (3) we present a comprehensive analysis of recent advancements in this field by discussing important challenges and new perspectives for bridging the academia-industry gap.
arXiv Detail & Related papers (2025-03-30T14:11:46Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
AExGym: Benchmarks and Environments for Adaptive Experimentation [7.948144726705323]
We present a benchmark for adaptive experimentation based on real-world datasets. We highlight prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity.
arXiv Detail & Related papers (2024-08-08T15:32:12Z)
Benchopt: Reproducible, efficient and collaborative optimization benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z)
Mapping global dynamics of benchmark creation and saturation in artificial intelligence [5.233652342195164]
We create maps of the global dynamics of benchmark creation and saturation. We curated data for 1688 benchmarks covering the entire domains of computer vision and natural language processing.
arXiv Detail & Related papers (2022-03-09T09:16:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.