Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization
- URL: http://arxiv.org/abs/2505.21321v1
- Date: Tue, 27 May 2025 15:18:58 GMT
- Title: Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization
- Authors: Leonard Papenmeier, Luigi Nardi,
- Abstract summary: Bencher is a modular benchmarking framework for black-box optimization.<n>Each benchmark is isolated in its own virtual Python environment and accessed via a unified, version-agnostic remote procedure call (RPC) interface.<n>Bencher can be deployed locally or remotely via Docker or on high-performance computing clusters via Singularity.
- Score: 5.703483582960509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Bencher, a modular benchmarking framework for black-box optimization that fundamentally decouples benchmark execution from optimization logic. Unlike prior suites that focus on combining many benchmarks in a single project, Bencher introduces a clean abstraction boundary: each benchmark is isolated in its own virtual Python environment and accessed via a unified, version-agnostic remote procedure call (RPC) interface. This design eliminates dependency conflicts and simplifies the integration of diverse, real-world benchmarks, which often have complex and conflicting software requirements. Bencher can be deployed locally or remotely via Docker or on high-performance computing (HPC) clusters via Singularity, providing a containerized, reproducible runtime for any benchmark. Its lightweight client requires minimal setup and supports drop-in evaluation of 80 benchmarks across continuous, categorical, and binary domains.
Related papers
- MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation [17.461533973039064]
MultiKernelBench is a benchmark for the generation of deep learning kernels using large language models (LLMs)<n>It spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms.<n>We show significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies.
arXiv Detail & Related papers (2025-07-20T00:58:33Z) - ConsumerBench: Benchmarking Generative AI Applications on End-User Devices [6.6246058403368595]
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience.<n>This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices.
arXiv Detail & Related papers (2025-06-21T01:32:22Z) - Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z) - Employing Continuous Integration inspired workflows for benchmarking of scientific software -- a use case on numerical cut cell quadrature [0.3387808070669509]
This paper presents a proven approach that utilizes established Continuous Integration tools and practices to achieve high automation of benchmark execution and reporting.<n>Our use case is the numerical integration (quadrature) on arbitrary domains, which are bounded by implicitly or parametrically defined curves or surfaces in 2D or 3D.
arXiv Detail & Related papers (2025-03-21T14:42:24Z) - EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain.<n>Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets.<n>To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z) - SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.<n>SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.<n>We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - RBoard: A Unified Platform for Reproducible and Reusable Recommender System Benchmarks [0.4312340306206883]
RBoard is a novel framework for benchmarking recommender systems.
It provides a comprehensive platform for benchmarking diverse recommendation tasks, including CTR prediction, Top-N recommendation, and others.
The framework evaluates algorithms across multiple datasets within each task, aggregating results for a holistic performance assessment.
arXiv Detail & Related papers (2024-09-09T11:35:35Z) - CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.
Our framework supports multiple devices and can be easily extended to any environment with a Python interface.
The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z) - A Multi-objective Optimization Benchmark Test Suite for Real-time Semantic Segmentation [22.707825213534125]
Hardware-aware Neural Architecture (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs)
We introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs.
We present a benchmark test suite, CitySeg/MOP, fifteen MOPs derived from the Cityscapes dataset.
arXiv Detail & Related papers (2024-04-25T00:30:03Z) - RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything [117.02741621686677]
This work explores a novel real-time segmentation setting called real-time multi-purpose segmentation.<n>It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation.<n>We present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM)<n>It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding.
arXiv Detail & Related papers (2024-01-18T18:59:30Z) - Dynabench: Rethinking Benchmarking in NLP [82.26699038776812]
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation.
We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
arXiv Detail & Related papers (2021-04-07T17:49:17Z) - AIBench: An Agile Domain-specific Benchmarking Methodology and an AI
Benchmark Suite [26.820244556465333]
This paper proposes an agile domain-specific benchmarking methodology.
We identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks.
We present the first end-to-end Internet service AI benchmark.
arXiv Detail & Related papers (2020-02-17T07:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.