Related papers: YAHPO Gym -- Design Criteria and a new Multifidelity Benchmark for Hyperparameter Optimization

YAHPO Gym -- Design Criteria and a new Multifidelity Benchmark for Hyperparameter Optimization

URL: http://arxiv.org/abs/2109.03670v1
Date: Wed, 8 Sep 2021 14:16:31 GMT
Title: YAHPO Gym -- Design Criteria and a new Multifidelity Benchmark for Hyperparameter Optimization
Authors: Florian Pfisterer, Lennart Schneider, Julia Moosbauer, Martin Binder, Bernd Bischl
Abstract summary: We present a new surrogate-based benchmark suite for multifidelity HPO methods consisting of 9 benchmark collections that constitute over 700 multifidelity HPO problems in total. All our benchmarks also allow for querying of multiple optimization targets, enabling the benchmarking of multi-objective HPO.
Score: 1.0718353079920009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When developing and analyzing new hyperparameter optimization (HPO) methods, it is vital to empirically evaluate and compare them on well-curated benchmark suites. In this work, we list desirable properties and requirements for such benchmarks and propose a new set of challenging and relevant multifidelity HPO benchmark problems motivated by these requirements. For this, we revisit the concept of surrogate-based benchmarks and empirically compare them to more widely-used tabular benchmarks, showing that the latter ones may induce bias in performance estimation and ranking of HPO methods. We present a new surrogate-based benchmark suite for multifidelity HPO methods consisting of 9 benchmark collections that constitute over 700 multifidelity HPO problems in total. All our benchmarks also allow for querying of multiple optimization targets, enabling the benchmarking of multi-objective HPO. We examine and compare our benchmark suite with respect to the defined requirements and show that our benchmarks provide viable additions to existing suites.

Related papers

In-context Ranking Preference Optimization [48.36442791241395]
We propose an In-context Ranking Preference Optimization (IRPO) framework to optimize large language models (LLMs) based on ranking lists constructed during inference. We show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
arXiv Detail & Related papers (2025-04-21T23:06:12Z)
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis [89.60263788590893]
Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression. Existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. We provide a novel benchmark for LLMs PTQ in this paper.
arXiv Detail & Related papers (2025-02-18T07:35:35Z)
Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark [54.93461228053298]
We introduce our benchmark, textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline. We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models.
arXiv Detail & Related papers (2024-12-23T08:15:34Z)
Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations. Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations. We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z)
LMEMs for post-hoc analysis of HPO Benchmarking [38.39259273088395]
We apply Linear Mixed-Effect Models-based (LMEMs) significance testing for post-hoc analysis of HPO benchmarking runs. LMEMs allow flexible and expressive modeling on the entire experiment data, including information such as benchmark meta-features. We demonstrate this through a case study on the PriorBand paper's experiment data to find insights not reported in the original work.
arXiv Detail & Related papers (2024-08-05T15:03:19Z)
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime [59.27851754647913]
Predictive optimization is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising. We develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for advertising. Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO.
arXiv Detail & Related papers (2023-11-13T13:19:34Z)
Obeying the Order: Introducing Ordered Transfer Hyperparameter Optimisation [10.761476482982077]
OTHPO is a version of transfer learning where the tasks follow a sequential order. We empirically show the importance of taking order into account using ten benchmarks. We open source the benchmarks to foster future research on ordered transfer HPO.
arXiv Detail & Related papers (2023-06-29T13:08:36Z)
FedHPO-B: A Benchmark Suite for Federated Hyperparameter Optimization [50.12374973760274]
We propose and implement a benchmark suite FedHPO-B that incorporates comprehensive FL tasks, enables efficient function evaluations, and eases continuing extensions. We also conduct extensive experiments based on FedHPO-B to benchmark a few HPO methods.
arXiv Detail & Related papers (2022-06-08T15:29:10Z)
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance. Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z)
A survey on multi-objective hyperparameter optimization algorithms for Machine Learning [62.997667081978825]
This article presents a systematic survey of the literature published between 2014 and 2020 on multi-objective HPO algorithms. We distinguish between metaheuristic-based algorithms, metamodel-based algorithms, and approaches using a mixture of both. We also discuss the quality metrics used to compare multi-objective HPO procedures and present future research directions.
arXiv Detail & Related papers (2021-11-23T10:22:30Z)
HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPO [30.89560505052524]
We propose HPOBench, which includes 7 existing and 5 new benchmark families, with in total more than 100 multi-fidelity benchmark problems. HPOBench allows to run this extendable set of multi-fidelity HPO benchmarks in a reproducible way by isolating and packaging the individual benchmarks in containers.
arXiv Detail & Related papers (2021-09-14T14:28:51Z)
HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML [5.735035463793008]
We present HPO-B, a large-scale benchmark for comparing HPO algorithms. Our benchmark is assembled and preprocessed from the OpenML repository. We detail explicit experimental protocols, splits, and evaluation measures for comparing methods for both non-transfer and transfer learning HPO.
arXiv Detail & Related papers (2021-06-11T09:18:39Z)
Do Question Answering Modeling Improvements Hold Across Benchmarks? [84.48867898593052]
We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches. Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
arXiv Detail & Related papers (2021-02-01T18:55:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.