Automatic Gradient Descent: Deep Learning without Hyperparameters
- URL: http://arxiv.org/abs/2304.05187v1
- Date: Tue, 11 Apr 2023 12:45:52 GMT
- Title: Automatic Gradient Descent: Deep Learning without Hyperparameters
- Authors: Jeremy Bernstein and Chris Mingard and Kevin Huang and Navid Azizan
and Yisong Yue
- Abstract summary: The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology.
Paper builds a new framework for deriving objective functions: gradient idea is to transform a Bregman divergence to account for the non gradient structure of neural architecture.
- Score: 35.350274248478804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The architecture of a deep neural network is defined explicitly in terms of
the number of layers, the width of each layer and the general network topology.
Existing optimisation frameworks neglect this information in favour of implicit
architectural information (e.g. second-order methods) or architecture-agnostic
distance functions (e.g. mirror descent). Meanwhile, the most popular optimiser
in practice, Adam, is based on heuristics. This paper builds a new framework
for deriving optimisation algorithms that explicitly leverage neural
architecture. The theory extends mirror descent to non-convex composite
objective functions: the idea is to transform a Bregman divergence to account
for the non-linear structure of neural architecture. Working through the
details for deep fully-connected networks yields automatic gradient descent: a
first-order optimiser without any hyperparameters. Automatic gradient descent
trains both fully-connected and convolutional networks out-of-the-box and at
ImageNet scale. A PyTorch implementation is available at
https://github.com/jxbz/agd and also in Appendix B. Overall, the paper supplies
a rigorous theoretical foundation for a next-generation of
architecture-dependent optimisers that work automatically and without
hyperparameters.
Related papers
- Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally [2.645067871482715]
In machine learning tasks, one searches for an optimal function within a certain functional space.
This way forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture.
We show that the information about desirable architectural changes, due to expressivity bottlenecks can be extracted from %the backpropagation.
arXiv Detail & Related papers (2024-05-30T08:23:56Z) - Principled Architecture-aware Scaling of Hyperparameters [69.98414153320894]
Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process.
In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture.
We demonstrate that network rankings can be easily changed by better training networks in benchmarks.
arXiv Detail & Related papers (2024-02-27T11:52:49Z) - DepGraph: Towards Any Structural Pruning [68.40343338847664]
We study general structural pruning of arbitrary architecture like CNNs, RNNs, GNNs and Transformers.
We propose a general and fully automatic method, emphDependency Graph (DepGraph), to explicitly model the dependency between layers and comprehensively group parameters for pruning.
In this work, we extensively evaluate our method on several architectures and tasks, including ResNe(X)t, DenseNet, MobileNet and Vision transformer for images, GAT for graph, DGCNN for 3D point cloud, alongside LSTM for language, and demonstrate that, even with a
arXiv Detail & Related papers (2023-01-30T14:02:33Z) - Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network.
We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint.
Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z) - Towards Disentangling Information Paths with Coded ResNeXt [11.884259630414515]
We take a novel approach to enhance the transparency of the function of the whole network.
We propose a neural network architecture for classification, in which the information that is relevant to each class flows through specific paths.
arXiv Detail & Related papers (2022-02-10T21:45:49Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - On the Implicit Biases of Architecture & Gradient Descent [46.34988166338264]
This paper finds that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin.
New technical tools suggest a nuanced portrait of generalisation involving both the implicit biases of architecture and gradient descent.
arXiv Detail & Related papers (2021-10-08T17:36:37Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - A Semi-Supervised Assessor of Neural Architectures [157.76189339451565]
We employ an auto-encoder to discover meaningful representations of neural architectures.
A graph convolutional neural network is introduced to predict the performance of architectures.
arXiv Detail & Related papers (2020-05-14T09:02:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.