Resolving learning rates adaptively by locating Stochastic Non-Negative
Associated Gradient Projection Points using line searches
- URL: http://arxiv.org/abs/2001.05113v1
- Date: Wed, 15 Jan 2020 03:08:07 GMT
- Title: Resolving learning rates adaptively by locating Stochastic Non-Negative
Associated Gradient Projection Points using line searches
- Authors: Dominic Kafka and Daniel N. Wilke
- Abstract summary: Learning rates in neural network training are currently determined a priori to training using expensive manual or automated tuning.
This study proposes gradient-only line searches to resolve the learning rate for neural network training algorithms.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning rates in stochastic neural network training are currently determined
a priori to training, using expensive manual or automated iterative tuning.
This study proposes gradient-only line searches to resolve the learning rate
for neural network training algorithms. Stochastic sub-sampling during training
decreases computational cost and allows the optimization algorithms to progress
over local minima. However, it also results in discontinuous cost functions.
Minimization line searches are not effective in this context, as they use a
vanishing derivative (first order optimality condition), which often do not
exist in a discontinuous cost function and therefore converge to
discontinuities as opposed to minima from the data trends. Instead, we base
candidate solutions along a search direction purely on gradient information, in
particular by a directional derivative sign change from negative to positive (a
Non-negative Associative Gradient Projection Point (NN- GPP)). Only considering
a sign change from negative to positive always indicates a minimum, thus
NN-GPPs contain second order information. Conversely, a vanishing gradient is
purely a first order condition, which may indicate a minimum, maximum or saddle
point. This insight allows the learning rate of an algorithm to be reliably
resolved as the step size along a search direction, increasing convergence
performance and eliminating an otherwise expensive hyperparameter.
Related papers
- Efficient line search for optimizing Area Under the ROC Curve in gradient descent [2.094821665776961]
We study the piecewise linear/constant nature of the Area Under Min (AUM) of false positive and false negative rates.
We propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of descent.
arXiv Detail & Related papers (2024-10-11T08:59:06Z) - An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness [15.656614304616006]
This paper investigates a class of bilevel optimization problems where the upper-level function is non- unbounded smoothness and the lower-level problem is strongly convex.
These problems have significant applications in data learning, such as text classification using neural networks.
arXiv Detail & Related papers (2024-09-28T02:30:44Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Ordering for Non-Replacement SGD [7.11967773739707]
We seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm.
We develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions.
In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks.
arXiv Detail & Related papers (2023-06-28T00:46:58Z) - Unsupervised Learning of Initialization in Deep Neural Networks via
Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data.
We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification.
We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z) - Sequential Gradient Descent and Quasi-Newton's Method for Change-Point
Analysis [0.348097307252416]
One common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points.
This paper introduces a new sequential method (SE) that can be coupled with gradient descent (SeGD) and quasi-Newton's method (SeN) to find the cost value effectively.
arXiv Detail & Related papers (2022-10-21T20:30:26Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Online estimation and control with optimal pathlength regret [52.28457815067461]
A natural goal when designing online learning algorithms is to bound the regret of the algorithm in terms of the temporal variation of the input sequence.
Data-dependent "pathlength" regret bounds have recently been obtained for a wide variety of online learning problems, including OCO and bandits.
arXiv Detail & Related papers (2021-10-24T22:43:15Z) - Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs.
The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z) - Convergence rates for gradient descent in the training of
overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches.
It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.