On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width
- URL: http://arxiv.org/abs/2312.12226v2
- Date: Sat, 8 Jun 2024 08:45:12 GMT
- Title: On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width
- Authors: Satoki Ishikawa, Ryo Karakida,
- Abstract summary: We identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner.
Inspired by a maximal update parameterization, we consider a one-step update of the gradient.
Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo.
- Score: 5.217870815854702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.
Related papers
- Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Trajectory-Based Multi-Objective Hyperparameter Optimization for Model Retraining [8.598456741786801]
We present a novel trajectory-based multi-objective Bayesian optimization algorithm.
Our algorithm outperforms the state-of-the-art multi-objectives in both locating better trade-offs and tuning efficiency.
arXiv Detail & Related papers (2024-05-24T07:43:45Z) - Principled Architecture-aware Scaling of Hyperparameters [69.98414153320894]
Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process.
In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture.
We demonstrate that network rankings can be easily changed by better training networks in benchmarks.
arXiv Detail & Related papers (2024-02-27T11:52:49Z) - Hybrid GRU-CNN Bilinear Parameters Initialization for Quantum
Approximate Optimization Algorithm [7.502733639318316]
We propose a hybrid optimization approach that integrates Gated Recurrent Units (GRU), Conal Neural Networks (CNN), and a bilinear strategy as an innovative alternative to conventional approximations for predicting optimal parameters of QAOA circuits.
We employ the bilinear strategy to initialize to QAOA circuit parameters at greater depths, with reference parameters obtained from GRU-CNN optimization.
arXiv Detail & Related papers (2023-11-14T03:00:39Z) - Improving Multi-fidelity Optimization with a Recurring Learning Rate for
Hyperparameter Tuning [7.591442522626255]
We propose Multi-fidelity Optimization with a Recurring Learning rate (MORL)
MORL incorporates CNNs' optimization process into multi-fidelity optimization.
It alleviates the problem of slow-starter and achieves a more precise low-fidelity approximation.
arXiv Detail & Related papers (2022-09-26T08:16:31Z) - Towards Learning Universal Hyperparameter Optimizers with Transformers [57.35920571605559]
We introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction.
Our experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates.
arXiv Detail & Related papers (2022-05-26T12:51:32Z) - Meta-Learning to Improve Pre-Training [38.75981465367226]
Pre-training (PT) followed by fine-tuning (FT) is an effective method for training neural networks.
PT can incorporate various design choices such as task and data reweighting strategies, augmentation policies, and noise models.
We propose an efficient, gradient-based algorithm to meta-learn PT hyper parameters.
arXiv Detail & Related papers (2021-11-02T17:26:50Z) - Automatic prior selection for meta Bayesian optimization with a case
study on tuning deep neural network optimizers [47.013395100497775]
We propose a principled approach to solve such expensive hyperparameter tuning problems efficiently.
Key to the performance of BO is specifying and refining a distribution over functions, which is used to reason about the optima of the underlying function being optimized.
We verify our approach in realistic model training setups by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets.
arXiv Detail & Related papers (2021-09-16T20:46:26Z) - Optimizing Large-Scale Hyperparameters via Automated Learning Algorithm [97.66038345864095]
We propose a new hyperparameter optimization method with zeroth-order hyper-gradients (HOZOG)
Specifically, we first formulate hyperparameter optimization as an A-based constrained optimization problem.
Then, we use the average zeroth-order hyper-gradients to update hyper parameters.
arXiv Detail & Related papers (2021-02-17T21:03:05Z) - Online hyperparameter optimization by real-time recurrent learning [57.01871583756586]
Our framework takes advantage of the analogy between hyperparameter optimization and parameter learning in neural networks (RNNs)
It adapts a well-studied family of online learning algorithms for RNNs to tune hyperparameters and network parameters simultaneously.
This procedure yields systematically better generalization performance compared to standard methods, at a fraction of wallclock time.
arXiv Detail & Related papers (2021-02-15T19:36:18Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.