Hyperparameter Loss Surfaces Are Simple Near their Optima
- URL: http://arxiv.org/abs/2510.02721v1
- Date: Fri, 03 Oct 2025 04:52:27 GMT
- Title: Hyperparameter Loss Surfaces Are Simple Near their Optima
- Authors: Nicholas Lourie, He He, Kyunghyun Cho,
- Abstract summary: We develop a technique based on random search to uncover the complex loss surface.<n>Within this regime, the best scores from random search take on a new distribution we discover.<n>From these features, we derive a new law for random search that can explain and extrapolate its convergence.<n>These new tools enable new analyses, such as confidence intervals for the best possible performance.
- Score: 50.74035795378814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hyperparameters greatly impact models' capabilities; however, modern models are too large for extensive search. Instead, researchers design recipes that train well across scales based on their understanding of the hyperparameters. Despite this importance, few tools exist for understanding the hyperparameter loss surface. We discover novel structure in it and propose a new theory yielding such tools. The loss surface is complex, but as you approach the optimum simple structure emerges. It becomes characterized by a few basic features, like its effective dimension and the best possible loss. To uncover this asymptotic regime, we develop a novel technique based on random search. Within this regime, the best scores from random search take on a new distribution we discover. Its parameters are exactly the features defining the loss surface in the asymptotic regime. From these features, we derive a new asymptotic law for random search that can explain and extrapolate its convergence. These new tools enable new analyses, such as confidence intervals for the best possible performance or determining the effective number of hyperparameters. We make these tools available at https://github.com/nicholaslourie/opda .
Related papers
- Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [59.369484219304866]
We conduct an unprecedented empirical investigation training over 3,700 Large Language Models (LLMs) from scratch across 100 trillion tokens.<n>We establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law.<n>Our estimated optima deviates from the global best performance found via exhaustive search by merely 0.094% on the test set.
arXiv Detail & Related papers (2025-03-06T18:58:29Z) - Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function [24.457000214575245]
We introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance.<n>This can be used to show that the learning theoretic complexity of the corresponding family of utility functions is bounded.
arXiv Detail & Related papers (2025-01-23T15:10:51Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections [59.839926875976225]
We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections.
In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters.
arXiv Detail & Related papers (2024-05-30T17:26:02Z) - Should We Learn Most Likely Functions or Parameters? [51.133793272222874]
We investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data.
We find that function-space MAP estimation can lead to flatter minima, better generalization, and improved to overfitting.
arXiv Detail & Related papers (2023-11-27T16:39:55Z) - Proximity to Losslessly Compressible Parameters [0.0]
In neural networks, an identical function can be implemented with fewer hidden units.
In the setting of single-hidden-layer hyperbolic tangent networks, we define the rank of a parameter as the minimum number of hidden units.
We show that the problem of tightly bounding the proximate rank of a parameter is NP-complete.
arXiv Detail & Related papers (2023-06-05T12:29:34Z) - On the Effectiveness of Parameter-Efficient Fine-Tuning [79.6302606855302]
Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks.
We show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them.
Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters.
arXiv Detail & Related papers (2022-11-28T17:41:48Z) - Provable Benefits of Overparameterization in Model Compression: From
Double Descent to Pruning Neural Networks [38.153825455980645]
Recent empirical evidence indicates that the practice of overization not only benefits training large models, but also assists - perhaps counterintuitively - building lightweight models.
This paper sheds light on these empirical findings by theoretically characterizing the high-dimensional toolsets of model pruning.
We analytically identify regimes in which, even if the location of the most informative features is known, we are better off fitting a large model and then pruning.
arXiv Detail & Related papers (2020-12-16T05:13:30Z) - Efficient hyperparameter optimization by way of PAC-Bayes bound
minimization [4.191847852775072]
We present an alternative objective that is equivalent to a Probably Approximately Correct-Bayes (PAC-Bayes) bound on the expected out-of-sample error.
We then devise an efficient gradient-based algorithm to minimize this objective.
arXiv Detail & Related papers (2020-08-14T15:54:51Z) - Weighted Random Search for Hyperparameter Optimization [0.0]
We introduce an improved version of Random Search (RS), used here for hyper parameter optimization of machine learning algorithms.
We generate new values for each hyper parameter with a probability of change, unlike the standard RS.
Within the same computational budget, our method yields better results than the standard RS.
arXiv Detail & Related papers (2020-04-03T15:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.