YAHPO Gym -- Design Criteria and a new Multifidelity Benchmark for
Hyperparameter Optimization
- URL: http://arxiv.org/abs/2109.03670v1
- Date: Wed, 8 Sep 2021 14:16:31 GMT
- Title: YAHPO Gym -- Design Criteria and a new Multifidelity Benchmark for
Hyperparameter Optimization
- Authors: Florian Pfisterer, Lennart Schneider, Julia Moosbauer, Martin Binder,
Bernd Bischl
- Abstract summary: We present a new surrogate-based benchmark suite for multifidelity HPO methods consisting of 9 benchmark collections that constitute over 700 multifidelity HPO problems in total.
All our benchmarks also allow for querying of multiple optimization targets, enabling the benchmarking of multi-objective HPO.
- Score: 1.0718353079920009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When developing and analyzing new hyperparameter optimization (HPO) methods,
it is vital to empirically evaluate and compare them on well-curated benchmark
suites. In this work, we list desirable properties and requirements for such
benchmarks and propose a new set of challenging and relevant multifidelity HPO
benchmark problems motivated by these requirements. For this, we revisit the
concept of surrogate-based benchmarks and empirically compare them to more
widely-used tabular benchmarks, showing that the latter ones may induce bias in
performance estimation and ranking of HPO methods. We present a new
surrogate-based benchmark suite for multifidelity HPO methods consisting of 9
benchmark collections that constitute over 700 multifidelity HPO problems in
total. All our benchmarks also allow for querying of multiple optimization
targets, enabling the benchmarking of multi-objective HPO. We examine and
compare our benchmark suite with respect to the defined requirements and show
that our benchmarks provide viable additions to existing suites.
Related papers
- Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis [89.60263788590893]
Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression.
Existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth.
arXiv Detail & Related papers (2025-02-18T07:35:35Z) - Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark [54.93461228053298]
We introduce our benchmark, textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline.
We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models.
arXiv Detail & Related papers (2024-12-23T08:15:34Z) - LMEMs for post-hoc analysis of HPO Benchmarking [38.39259273088395]
We apply Linear Mixed-Effect Models-based (LMEMs) significance testing for post-hoc analysis of HPO benchmarking runs.
LMEMs allow flexible and expressive modeling on the entire experiment data, including information such as benchmark meta-features.
We demonstrate this through a case study on the PriorBand paper's experiment data to find insights not reported in the original work.
arXiv Detail & Related papers (2024-08-05T15:03:19Z) - Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime [59.27851754647913]
Predictive optimization is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising.
We develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for advertising.
Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO.
arXiv Detail & Related papers (2023-11-13T13:19:34Z) - Obeying the Order: Introducing Ordered Transfer Hyperparameter
Optimisation [10.761476482982077]
OTHPO is a version of transfer learning where the tasks follow a sequential order.
We empirically show the importance of taking order into account using ten benchmarks.
We open source the benchmarks to foster future research on ordered transfer HPO.
arXiv Detail & Related papers (2023-06-29T13:08:36Z) - FedHPO-B: A Benchmark Suite for Federated Hyperparameter Optimization [50.12374973760274]
We propose and implement a benchmark suite FedHPO-B that incorporates comprehensive FL tasks, enables efficient function evaluations, and eases continuing extensions.
We also conduct extensive experiments based on FedHPO-B to benchmark a few HPO methods.
arXiv Detail & Related papers (2022-06-08T15:29:10Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems
for HPO [30.89560505052524]
We propose HPOBench, which includes 7 existing and 5 new benchmark families, with in total more than 100 multi-fidelity benchmark problems.
HPOBench allows to run this extendable set of multi-fidelity HPO benchmarks in a reproducible way by isolating and packaging the individual benchmarks in containers.
arXiv Detail & Related papers (2021-09-14T14:28:51Z) - HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on
OpenML [5.735035463793008]
We present HPO-B, a large-scale benchmark for comparing HPO algorithms.
Our benchmark is assembled and preprocessed from the OpenML repository.
We detail explicit experimental protocols, splits, and evaluation measures for comparing methods for both non-transfer and transfer learning HPO.
arXiv Detail & Related papers (2021-06-11T09:18:39Z) - Do Question Answering Modeling Improvements Hold Across Benchmarks? [84.48867898593052]
We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches.
Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
arXiv Detail & Related papers (2021-02-01T18:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.