Principled Architecture-aware Scaling of Hyperparameters
- URL: http://arxiv.org/abs/2402.17440v1
- Date: Tue, 27 Feb 2024 11:52:49 GMT
- Title: Principled Architecture-aware Scaling of Hyperparameters
- Authors: Wuyang Chen, Junru Wu, Zhangyang Wang, Boris Hanin
- Abstract summary: Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process.
In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture.
We demonstrate that network rankings can be easily changed by better training networks in benchmarks.
- Score: 69.98414153320894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a high-quality deep neural network requires choosing suitable
hyperparameters, which is a non-trivial and expensive process. Current works
try to automatically optimize or design principles of hyperparameters, such
that they can generalize to diverse unseen scenarios. However, most designs or
optimization methods are agnostic to the choice of network structures, and thus
largely ignore the impact of neural architectures on hyperparameters. In this
work, we precisely characterize the dependence of initializations and maximal
learning rates on the network architecture, which includes the network depth,
width, convolutional kernel size, and connectivity patterns. By pursuing every
parameter to be maximally updated with the same mean squared change in
pre-activations, we can generalize our initialization and learning rates across
MLPs (multi-layer perception) and CNNs (convolutional neural network) with
sophisticated graph topologies. We verify our principles with comprehensive
experiments. More importantly, our strategy further sheds light on advancing
current benchmarks for architecture design. A fair comparison of AutoML
algorithms requires accurate network rankings. However, we demonstrate that
network rankings can be easily changed by better training networks in
benchmarks with our architecture-aware learning rates and initialization.
Related papers
- Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally [2.645067871482715]
In machine learning tasks, one searches for an optimal function within a certain functional space.
This way forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture.
We show that the information about desirable architectural changes, due to expressivity bottlenecks can be extracted from %the backpropagation.
arXiv Detail & Related papers (2024-05-30T08:23:56Z) - Neuroevolution of Recurrent Architectures on Control Tasks [3.04585143845864]
We implement a massively parallel evolutionary algorithm and run experiments on all 19 OpenAI Gym state-based reinforcement learning control tasks.
We find that dynamic agents match or exceed the performance of gradient-based agents while utilizing orders of magnitude fewer parameters.
arXiv Detail & Related papers (2023-04-03T16:29:18Z) - Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network.
We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint.
Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z) - Edge of chaos as a guiding principle for modern neural network training [19.419382003562976]
We study the role of various hyperparameters in modern neural network training algorithms in terms of the order-chaos phase diagram.
In particular, we study a fully analytical feedforward neural network trained on the widely adopted Fashion-MNIST dataset.
arXiv Detail & Related papers (2021-07-20T12:17:55Z) - Differentiable Neural Architecture Learning for Efficient Neural Network
Design [31.23038136038325]
We introduce a novel emph architecture parameterisation based on scaled sigmoid function.
We then propose a general emphiable Neural Architecture Learning (DNAL) method to optimize the neural architecture without the need to evaluate candidate neural networks.
arXiv Detail & Related papers (2021-03-03T02:03:08Z) - Firefly Neural Architecture Descent: a General Approach for Growing
Neural Networks [50.684661759340145]
Firefly neural architecture descent is a general framework for progressively and dynamically growing neural networks.
We show that firefly descent can flexibly grow networks both wider and deeper, and can be applied to learn accurate but resource-efficient neural architectures.
In particular, it learns networks that are smaller in size but have higher average accuracy than those learned by the state-of-the-art methods.
arXiv Detail & Related papers (2021-02-17T04:47:18Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Dynamic Hierarchical Mimicking Towards Consistent Optimization
Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability.
Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network.
Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.