Corner Gradient Descent
- URL: http://arxiv.org/abs/2504.12519v1
- Date: Wed, 16 Apr 2025 22:39:41 GMT
- Title: Corner Gradient Descent
- Authors: Dmitry Yarotsky,
- Abstract summary: We show that rates up to $O(t-2zeta)$ can be achieved by a generalized stationary SGD with infinite memory.<n>We show that ideal corner algorithms can be efficiently approximated by finite-memory algorithms.
- Score: 13.794391803767617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates $L_t=O(t^{-\zeta})$, which can be improved to $L_t=O(t^{-2\zeta})$ by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no $O(t^{-2\zeta})$ algorithm is known. In this paper we show that rates up to $O(t^{-2\zeta})$ can be achieved by a generalized stationary SGD with infinite memory. We start by identifying generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle $\theta\pi$ accelerate the plain GD rate $O(t^{-\zeta})$ to $O(t^{-\theta\zeta})$. For deterministic GD, increasing $\theta$ allows to achieve rates arbitrarily close to $O(t^{-2\zeta})$. However, in Stochastic GD, increasing $\theta$ also amplifies the sampling noise, so in general $\theta$ needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by $\theta_{\max}=\min(2,\nu,\tfrac{2}{\zeta+1/\nu})$, where $\nu,\zeta$ are the exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by finite-memory algorithms, and demonstrate their practical efficiency on a synthetic problem and MNIST.
Related papers
- Global Optimization with A Power-Transformed Objective and Gaussian Smoothing [4.275224221939364]
We show that our method converges to a solution in the $delta$-neighborhood of $f$'s global optimum point.<n>The convergence rate is $O(d2sigma4varepsilon-2)$, which is faster than both the standard and single-loop homotopy methods.
arXiv Detail & Related papers (2024-12-06T17:33:43Z) - SGD with memory: fundamental properties and stochastic acceleration [15.25021403154845]
In non-stochastic setting, the optimal exponent $xi$ in the loss convergence $L_tsim C_Lt-xi is double that in plain GD.<n>We show that in memory-1 algorithms we can make $C_L$ arbitrarily small while maintaining stability.
arXiv Detail & Related papers (2024-10-05T16:57:40Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD [38.221784575853796]
This work considers the problem of finding first-order stationary point of a non atau function with potentially constant smoothness using a gradient.
We develop a technique that allows us to prove $mathcalO(fracmathrmpolylog(T)sigmatT)$ convergence rates without assuming uniform bounds on the noise.
arXiv Detail & Related papers (2023-02-13T18:13:36Z) - Mind the gap: Achieving a super-Grover quantum speedup by jumping to the
end [114.3957763744719]
We present a quantum algorithm that has rigorous runtime guarantees for several families of binary optimization problems.
We show that the algorithm finds the optimal solution in time $O*(2(0.5-c)n)$ for an $n$-independent constant $c$.
We also show that for a large fraction of random instances from the $k$-spin model and for any fully satisfiable or slightly frustrated $k$-CSP formula, statement (a) is the case.
arXiv Detail & Related papers (2022-12-03T02:45:23Z) - Differentially Private Stochastic Gradient Descent with Low-Noise [49.981789906200035]
Modern machine learning algorithms aim to extract fine-grained information from data to provide accurate predictions, which often conflicts with the goal of privacy protection.
This paper addresses the practical and theoretical importance of developing privacy-preserving machine learning algorithms that ensure good performance while preserving privacy.
arXiv Detail & Related papers (2022-09-09T08:54:13Z) - Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z) - Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations.
This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation.
We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z) - Curse of Dimensionality on Randomized Smoothing for Certifiable
Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging.
We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.