Random Function Descent
- URL: http://arxiv.org/abs/2305.01377v2
- Date: Mon, 10 Jun 2024 07:57:24 GMT
- Title: Random Function Descent
- Authors: Felix Benning, Leif Döring,
- Abstract summary: We use a'stochastic Taylor approximation' to rediscover gradient descent.
This rediscovery yields a step size schedule we call Random Descent (RFD)
We propose a statistical procedure for estimating the RFD step size schedule and validate this theory with a case study on the MNIST dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. We establish a connection between Bayesian Optimization (i.e. average case optimization theory) and classical optimization using a 'stochastic Taylor approximation' to rediscover gradient descent. This rediscovery yields a step size schedule we call Random Function Descent (RFD), which, in contrast to classical derivations, is scale invariant. Furthermore, our analysis of RFD step sizes yields a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup. We finally propose a statistical procedure for estimating the RFD step size schedule and validate this theory with a case study on the MNIST dataset.
Related papers
- ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache.
This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z) - Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
Gradient clipping is a technique used in deep learning applications such as large-scale language modeling.
Recent experimental training have a fairly special behavior in that it mitigates order complexity.
arXiv Detail & Related papers (2023-03-02T00:57:38Z) - Deterministic Nonsmooth Nonconvex Optimization [94.01526844386977]
We show that randomization is necessary to obtain a dimension-free dimension-free algorithm.
Our algorithm yields the first deterministic dimension-free algorithm for optimizing ReLU networks.
arXiv Detail & Related papers (2023-02-16T13:57:19Z) - Target-based Surrogates for Stochastic Optimization [26.35752393302125]
We consider minimizing functions for which it is expensive to compute the (possibly) gradient.
Such functions are prevalent in computation reinforcement learning, imitation learning and adversarial training.
Our framework allows the use of standard optimization algorithms to construct surrogates which can be minimized efficiently.
arXiv Detail & Related papers (2023-02-06T08:08:34Z) - Finding Global Minima via Kernel Approximations [90.42048080064849]
We consider the global minimization of smooth functions based solely on function evaluations.
In this paper, we consider an approach that jointly models the function to approximate and finds a global minimum.
arXiv Detail & Related papers (2020-12-22T12:59:30Z) - Non-local Optimization: Imposing Structure on Optimization Problems by
Relaxation [0.0]
In evolutionary computation and reinforcement learning, the optimization of a function $f: Omega to mathbbR$ is often addressed through optimizing a so-called relaxation $theta in Theta mapsto mathbbE_theta(f)$ of $f$.
We investigate the structure of such relaxations by means of measure theory and Fourier analysis, enabling us to shed light on the success of many associated optimization methods.
arXiv Detail & Related papers (2020-11-11T20:45:47Z) - SGB: Stochastic Gradient Bound Method for Optimizing Partition Functions [15.33098084159285]
This paper addresses the problem of optimizing partition functions in a learning setting.
We propose a variant of the bound majorization algorithm that relies on upper-bounding the partition function with a quadratic surrogate.
arXiv Detail & Related papers (2020-11-03T04:42:51Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Incorporating Expert Prior in Bayesian Optimisation via Space Warping [54.412024556499254]
In big search spaces the algorithm goes through several low function value regions before reaching the optimum of the function.
One approach to subside this cold start phase is to use prior knowledge that can accelerate the optimisation.
In this paper, we represent the prior knowledge about the function optimum through a prior distribution.
The prior distribution is then used to warp the search space in such a way that space gets expanded around the high probability region of function optimum and shrinks around low probability region of optimum.
arXiv Detail & Related papers (2020-03-27T06:18:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.