Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models
via Reparameterisation and Smoothing
- URL: http://arxiv.org/abs/2402.11752v2
- Date: Tue, 20 Feb 2024 02:58:38 GMT
- Title: Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models
via Reparameterisation and Smoothing
- Authors: Dominik Wagner, Basim Khajwal, C.-H. Luke Ong
- Abstract summary: We introduce a simple framework to define non-differentiable functions piecewisely and present a systematic approach to obtain smoothings.
Our main contribution is a novel variant of SGD, Diagonalisation Gradient Descent, which progressively enhances the accuracy of the smoothed approximation.
Our approach is simple, fast stable and attains orders of magnitude reduction in work-normalised variance.
- Score: 1.6114012813668932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is well-known that the reparameterisation gradient estimator, which
exhibits low variance in practice, is biased for non-differentiable models.
This may compromise correctness of gradient-based optimisation methods such as
stochastic gradient descent (SGD). We introduce a simple syntactic framework to
define non-differentiable functions piecewisely and present a systematic
approach to obtain smoothings for which the reparameterisation gradient
estimator is unbiased. Our main contribution is a novel variant of SGD,
Diagonalisation Stochastic Gradient Descent, which progressively enhances the
accuracy of the smoothed approximation during optimisation, and we prove
convergence to stationary points of the unsmoothed (original) objective. Our
empirical evaluation reveals benefits over the state of the art: our approach
is simple, fast, stable and attains orders of magnitude reduction in
work-normalised variance.
Related papers
- Directional Smoothness and Gradient Methods: Convergence and Adaptivity [16.779513676120096]
We develop new sub-optimality bounds for gradient descent that depend on the conditioning of the objective along the path of optimization.
Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective.
We prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness.
arXiv Detail & Related papers (2024-03-06T22:24:05Z) - Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation [0.8192907805418583]
We show that biased gradients converge to critical points for smooth non- functions.
We show how the effect of bias can be reduced by appropriate tuning.
arXiv Detail & Related papers (2024-02-05T10:17:36Z) - Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent [0.3831327965422187]
This paper proposes a novel approach to adaptive step sizes in gradient descent (SGD)
We use quantities that we have identified as numerically traceable -- the Lipschitz constant for gradients and a concept of the local variance in search directions.
arXiv Detail & Related papers (2023-11-28T17:03:56Z) - Fast and Correct Gradient-Based Optimisation for Probabilistic
Programming via Smoothing [0.0]
We study the foundations of variational inference, which frames posterior inference as an optimisation problem.
We endow our language with both a measurable and a smoothed (approximate) value semantics.
We show that our approach has a similar convergence as a key competitor, but is simpler, faster, and attains orders of magnitude reduction in work-normalised variance.
arXiv Detail & Related papers (2023-01-09T15:12:45Z) - The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded
Gradients and Affine Variance [46.15915820243487]
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
arXiv Detail & Related papers (2022-02-11T17:37:54Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient
Clipping [69.9674326582747]
We propose a new accelerated first-order method called clipped-SSTM for smooth convex optimization with heavy-tailed distributed noise in gradients.
We prove new complexity that outperform state-of-the-art results in this case.
We derive the first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise.
arXiv Detail & Related papers (2020-05-21T17:05:27Z) - On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization [80.03647903934723]
We prove adaptive gradient methods in expectation of gradient convergence methods.
Our analyses shed light on better adaptive gradient methods in optimizing non understanding gradient bounds.
arXiv Detail & Related papers (2018-08-16T20:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.