Related papers: Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization

Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization

URL: http://arxiv.org/abs/2405.15393v1
Date: Fri, 24 May 2024 09:48:18 GMT
Title: Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization
Authors: Thomas Nagler, Lennart Schneider, Bernd Bischl, Matthias Feurer,
Abstract summary: We show that, surprisingly, reshuffling the splits for every configuration often improves the final model's generalization performance. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol.
Score: 11.094232017583177
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hyperparameter optimization is crucial for obtaining peak performance of machine learning models. The standard protocol evaluates various hyperparameter configurations using a resampling estimate of the generalization error to guide optimization and select a final hyperparameter configuration. Without much evidence, paired resampling splits, i.e., either a fixed train-validation split or a fixed cross-validation scheme, are often recommended. We show that, surprisingly, reshuffling the splits for every configuration often improves the final model's generalization performance on unseen data. Our theoretical analysis explains how reshuffling affects the asymptotic behavior of the validation loss surface and provides a bound on the expected regret in the limiting regime. This bound connects the potential benefits of reshuffling to the signal and noise characteristics of the underlying optimization problem. We confirm our theoretical results in a controlled simulation study and demonstrate the practical usefulness of reshuffling in a large-scale, realistic hyperparameter optimization experiment. While reshuffling leads to test performances that are competitive with using fixed splits, it drastically improves results for a single train-validation holdout protocol and can often make holdout become competitive with standard CV while being computationally cheaper.

Related papers

You Only Train Once [11.97836331714694]
You Only Train Once (YOTO) contributes to limiting training to one shot for the latter aspect of losses selection and weighting.<n>We leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously.<n>We show that YOTO consistently outperforms the best grid-search model on unseen test data.
arXiv Detail & Related papers (2025-06-04T18:04:58Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Interim Report on Human-Guided Adaptive Hyperparameter Optimization with Multi-Fidelity Sprints [0.0]
This case study applies a phased hyperparameter optimization process to compare multitask natural language model variants.<n>We employ short, Bayesian optimization sessions that leverage multi-fidelity, hyperparameter space pruning, progressive halving, and a degree of human guidance.<n>We demonstrate our method on a collection of variants of the 2021 Joint Entity and Relation Extraction model proposed by Eberts and Ulges.
arXiv Detail & Related papers (2025-05-14T20:38:44Z)
Integration of nested cross-validation, automated hyperparameter optimization, high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models [2.4901555666568624]
This study introduces NACHOS to reduce and quantify the variance of test performance metrics of deep learning models. NACHOS integrates NCV and AHPO within a parallelized high-performance computing framework. DACHOS is introduced to leverage AHPO and cross-validation to build the final model on the full dataset.
arXiv Detail & Related papers (2025-03-11T16:25:44Z)
Performance-driven Constrained Optimal Auto-Tuner for MPC [36.143463447995536]
We propose COAT-MPC, Constrained Optimal Auto-Tuner for MPC. COAT-MPC gathers performance data and learns by updating its posterior belief. We theoretically analyze COAT-MPC, showing that it satisfies performance constraints with arbitrarily high probability.
arXiv Detail & Related papers (2025-03-10T09:56:08Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts. Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Tune without Validation: Searching for Learning Rate and Weight Decay on Training Sets [0.0]
Tune without validation (Twin) is a pipeline for tuning learning rate and weight decay. We run extensive experiments on 20 image classification datasets and train several families of deep networks. We demonstrate proper HP selection when training from scratch and fine-tuning, emphasizing small-sample scenarios.
arXiv Detail & Related papers (2024-03-08T18:57:00Z)
Stability-Adjusted Cross-Validation for Sparse Linear Regression [5.156484100374059]
Cross-validation techniques like k-fold cross-validation substantially increase the computational cost of sparse regression. We propose selecting hyper parameters that minimize a weighted sum of a cross-validation metric and a model's output stability. Our confidence adjustment procedure reduces test set error by 2%, on average, on 13 real-world datasets.
arXiv Detail & Related papers (2023-06-26T17:02:45Z)
HyperTime: Hyperparameter Optimization for Combating Temporal Distribution Shifts [26.205660967039087]
We use the lexicographic priority order on average validation loss and worst-case validation loss over chronological validation sets. We show the strong empirical performance of the proposed method on multiple machine learning tasks with temporal distribution shifts.
arXiv Detail & Related papers (2023-05-28T19:41:23Z)
Hyperparameter Optimization through Neural Network Partitioning [11.6941692990626]
We propose a simple and efficient way for optimizing hyper parameters in neural networks. Our method partitions the training data and a neural network model into $K$ data shards and parameter partitions. We demonstrate that we can apply this objective to optimize a variety of different hyper parameters in a single training run.
arXiv Detail & Related papers (2023-04-28T11:24:41Z)
Toward Theoretical Guidance for Two Common Questions in Practical Cross-Validation based Hyperparameter Selection [72.76113104079678]
We show the first theoretical treatments of two common questions in cross-validation based hyperparameter selection. We show that these generalizations can, respectively, always perform at least as well as always performing retraining or never performing retraining.
arXiv Detail & Related papers (2023-01-12T16:37:12Z)
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates. The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z)
Provably tuning the ElasticNet across instances [53.0518090093538]
We consider the problem of tuning the regularization parameters of Ridge regression, LASSO, and the ElasticNet across multiple problem instances. Our results are the first general learning-theoretic guarantees for this important class of problems.
arXiv Detail & Related papers (2022-07-20T21:22:40Z)
Sample-Efficient Optimisation with Probabilistic Transformer Surrogates [66.98962321504085]
This paper investigates the feasibility of employing state-of-the-art probabilistic transformers in Bayesian optimisation. We observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. We introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance.
arXiv Detail & Related papers (2022-05-27T11:13:17Z)
Overfitting in Bayesian Optimization: an empirical study and early-stopping solution [41.782410830989136]
We propose the first problem-adaptive and interpretable criterion to early stop BO. We show that our approach can substantially reduce compute time with little to no loss of test accuracy.
arXiv Detail & Related papers (2021-04-16T15:26:23Z)
How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers [86.36020260204302]
We propose a new benchmarking protocol to evaluate both end-to-end efficiency and data-addition training efficiency. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. We then apply the proposed benchmarking framework to 7s and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining.
arXiv Detail & Related papers (2020-10-19T21:46:39Z)
Rethinking the Hyperparameters for Fine-tuning [78.15505286781293]
Fine-tuning from pre-trained ImageNet models has become the de-facto standard for various computer vision tasks. Current practices for fine-tuning typically involve selecting an ad-hoc choice of hyper parameters. This paper re-examines several common practices of setting hyper parameters for fine-tuning.
arXiv Detail & Related papers (2020-02-19T18:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.