Derivatives of Stochastic Gradient Descent in parametric optimization
- URL: http://arxiv.org/abs/2405.15894v2
- Date: Wed, 20 Nov 2024 09:04:29 GMT
- Title: Derivatives of Stochastic Gradient Descent in parametric optimization
- Authors: Franck Iutzeler, Edouard Pauwels, Samuel Vaiter,
- Abstract summary: We investigate the behavior of the derivatives of the iterates of Gradient Descent (SGD)
We show that they are driven by an inexact SGD on a different objective function, perturbed by the convergence of the original SGD.
Specifically, we demonstrate that with constant step-sizes, these derivatives stabilize within a noise ball centered at the solution derivative, and that with vanishing step-sizes they exhibit $O(log(k)2 / k)$ convergence rates.
- Score: 16.90974792716146
- License:
- Abstract: We consider stochastic optimization problems where the objective depends on some parameter, as commonly found in hyperparameter optimization for instance. We investigate the behavior of the derivatives of the iterates of Stochastic Gradient Descent (SGD) with respect to that parameter and show that they are driven by an inexact SGD recursion on a different objective function, perturbed by the convergence of the original SGD. This enables us to establish that the derivatives of SGD converge to the derivative of the solution mapping in terms of mean squared error whenever the objective is strongly convex. Specifically, we demonstrate that with constant step-sizes, these derivatives stabilize within a noise ball centered at the solution derivative, and that with vanishing step-sizes they exhibit $O(\log(k)^2 / k)$ convergence rates. Additionally, we prove exponential convergence in the interpolation regime. Our theoretical findings are illustrated by numerical experiments on synthetic tasks.
Related papers
- Optimal estimators of cross-partial derivatives and surrogates of functions [0.0]
This paper introduces surrogates of all the cross-partial derivatives of functions by evaluating such functions at $N$ randomized points.
The associated estimators, based on $NL$ model runs, reach the optimal rates of convergence.
Such results are used for i) computing the main and upper-bounds of sensitivity indices, and ii) deriving emulators of simulators or surrogates of functions thanks to the derivative-based ANOVA Simulations.
arXiv Detail & Related papers (2024-07-05T03:39:06Z) - Nonsmooth Implicit Differentiation: Deterministic and Stochastic Convergence Rates [34.81849268839475]
We study the problem of efficiently computing the derivative of the fixed-point of a parametric nondifferentiable contraction map.
We analyze two popular approaches: iterative differentiation (ITD) and approximate implicit differentiation (AID)
We establish rates for the convergence of NSID, encompassing the best available rates in the smooth setting.
arXiv Detail & Related papers (2024-03-18T11:37:53Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Curvature-Independent Last-Iterate Convergence for Games on Riemannian
Manifolds [77.4346324549323]
We show that a step size agnostic to the curvature of the manifold achieves a curvature-independent and linear last-iterate convergence rate.
To the best of our knowledge, the possibility of curvature-independent rates and/or last-iterate convergence has not been considered before.
arXiv Detail & Related papers (2023-06-29T01:20:44Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Single Trajectory Nonparametric Learning of Nonlinear Dynamics [8.438421942654292]
Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE)
We leverage recently developed information-theoretic methods to establish the optimality of the LSE for non hypotheses classes.
We specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS)
arXiv Detail & Related papers (2022-02-16T19:38:54Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - On the Estimation of Derivatives Using Plug-in Kernel Ridge Regression
Estimators [4.392844455327199]
We propose a simple plug-in kernel ridge regression (KRR) estimator in nonparametric regression.
We provide a non-asymotic analysis to study the behavior of the proposed estimator in a unified manner.
The proposed estimator achieves the optimal rate of convergence with the same choice of tuning parameter for any order of derivatives.
arXiv Detail & Related papers (2020-06-02T02:32:39Z) - Convergence rates and approximation results for SGD and its
continuous-time counterpart [16.70533901524849]
This paper proposes a thorough theoretical analysis of convex Gradient Descent (SGD) with non-increasing step sizes.
First, we show that the SGD can be provably approximated by solutions of inhomogeneous Differential Equation (SDE) using coupling.
Recent analyses of deterministic and optimization methods by their continuous counterpart, we study the long-time behavior of the continuous processes at hand and non-asymptotic bounds.
arXiv Detail & Related papers (2020-04-08T18:31:34Z) - SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for
Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features.
We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z) - Support recovery and sup-norm convergence rates for sparse pivotal
estimation [79.13844065776928]
In high dimensional sparse regression, pivotal estimators are estimators for which the optimal regularization parameter is independent of the noise level.
We show minimax sup-norm convergence rates for non smoothed and smoothed, single task and multitask square-root Lasso-type estimators.
arXiv Detail & Related papers (2020-01-15T16:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.