Related papers: Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization

Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization

URL: http://arxiv.org/abs/2505.21321v1
Date: Tue, 27 May 2025 15:18:58 GMT
Title: Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization
Authors: Leonard Papenmeier, Luigi Nardi,
Abstract summary: Bencher is a modular benchmarking framework for black-box optimization.<n>Each benchmark is isolated in its own virtual Python environment and accessed via a unified, version-agnostic remote procedure call (RPC) interface.<n>Bencher can be deployed locally or remotely via Docker or on high-performance computing clusters via Singularity.
Score: 5.703483582960509
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Bencher, a modular benchmarking framework for black-box optimization that fundamentally decouples benchmark execution from optimization logic. Unlike prior suites that focus on combining many benchmarks in a single project, Bencher introduces a clean abstraction boundary: each benchmark is isolated in its own virtual Python environment and accessed via a unified, version-agnostic remote procedure call (RPC) interface. This design eliminates dependency conflicts and simplifies the integration of diverse, real-world benchmarks, which often have complex and conflicting software requirements. Bencher can be deployed locally or remotely via Docker or on high-performance computing (HPC) clusters via Singularity, providing a containerized, reproducible runtime for any benchmark. Its lightweight client requires minimal setup and supports drop-in evaluation of 80 benchmarks across continuous, categorical, and binary domains.

Related papers

MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation [17.461533973039064]
MultiKernelBench is a benchmark for the generation of deep learning kernels using large language models (LLMs)<n>It spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms.<n>We show significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies.
arXiv Detail & Related papers (2025-07-20T00:58:33Z)
ConsumerBench: Benchmarking Generative AI Applications on End-User Devices [6.6246058403368595]
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience.<n>This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices.
arXiv Detail & Related papers (2025-06-21T01:32:22Z)
Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z)
Employing Continuous Integration inspired workflows for benchmarking of scientific software -- a use case on numerical cut cell quadrature [0.3387808070669509]
This paper presents a proven approach that utilizes established Continuous Integration tools and practices to achieve high automation of benchmark execution and reporting.<n>Our use case is the numerical integration (quadrature) on arbitrary domains, which are bounded by implicitly or parametrically defined curves or surfaces in 2D or 3D.
arXiv Detail & Related papers (2025-03-21T14:42:24Z)
EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain.<n>Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets.<n>To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z)
SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.<n>SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.<n>We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z)
RBoard: A Unified Platform for Reproducible and Reusable Recommender System Benchmarks [0.4312340306206883]
RBoard is a novel framework for benchmarking recommender systems. It provides a comprehensive platform for benchmarking diverse recommendation tasks, including CTR prediction, Top-N recommendation, and others. The framework evaluates algorithms across multiple datasets within each task, aggregating results for a holistic performance assessment.
arXiv Detail & Related papers (2024-09-09T11:35:35Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
A Multi-objective Optimization Benchmark Test Suite for Real-time Semantic Segmentation [22.707825213534125]
Hardware-aware Neural Architecture (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs) We introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. We present a benchmark test suite, CitySeg/MOP, fifteen MOPs derived from the Cityscapes dataset.
arXiv Detail & Related papers (2024-04-25T00:30:03Z)
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything [117.02741621686677]
This work explores a novel real-time segmentation setting called real-time multi-purpose segmentation.<n>It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation.<n>We present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM)<n>It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding.
arXiv Detail & Related papers (2024-01-18T18:59:30Z)
Dynabench: Rethinking Benchmarking in NLP [82.26699038776812]
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
arXiv Detail & Related papers (2021-04-07T17:49:17Z)
AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite [26.820244556465333]
This paper proposes an agile domain-specific benchmarking methodology. We identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We present the first end-to-end Internet service AI benchmark.
arXiv Detail & Related papers (2020-02-17T07:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.