Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation
- URL: http://arxiv.org/abs/2504.01898v1
- Date: Wed, 02 Apr 2025 16:57:39 GMT
- Title: Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation
- Authors: Robert M. Gower, Guillaume Garrigos, Nicolas Loizou, Dimitris Oikonomou, Konstantin Mishchenko, Fabian Schaipp,
- Abstract summary: We provide a general convergence theorem of an idealized Polyak step size called SPS$*$.<n>We refer to SPS$*$ as idealized because it requires access to the loss for every training batch evaluated at a solution.<n>It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/sqrtt)$ anytime convergence in the smooth setting.
- Score: 17.943901563004275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.
Related papers
- Improved Rates of Differentially Private Nonconvex-Strongly-Concave Minimax Optimization [10.913566070767596]
We study the problem of (fin sum) minimax optimization in the Differential Privacy (DP) model.<n>We show that it is possible to get an estimator whose Descent $l$-norm of the empirical risk function is upper bounded by $tO(n)(n)$, whered is the sample size.
arXiv Detail & Related papers (2025-03-24T03:51:27Z) - Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods.<n>In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Asymptotics of Language Model Alignment [27.37118975691123]
We show that the optimal KL-constrained RL solution satisfies a large deviation principle.
We also show that the rate of growth of the scaled cumulants of the reward is characterized by proper Renyi cross entropy.
arXiv Detail & Related papers (2024-04-02T08:40:07Z) - Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z) - KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal [70.15267479220691]
We consider and analyze the sample complexity of model reinforcement learning with a generative variance-free model.
Our analysis shows that it is nearly minimax-optimal for finding an $varepsilon$-optimal policy when $varepsilon$ is sufficiently small.
arXiv Detail & Related papers (2022-05-27T19:39:24Z) - Provably Efficient Convergence of Primal-Dual Actor-Critic with
Nonlinear Function Approximation [15.319335698574932]
We show the first efficient convergence result with primal-dual actor-critic with a convergence of $mathcalOleft ascent(Nright)Nright)$ under Polyian sampling.
Results on Open GymAI continuous control tasks.
arXiv Detail & Related papers (2022-02-28T15:16:23Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Better Theory for SGD in the Nonconvex World [2.6397379133308214]
Large-scale non optimization problems are ubiquitous in modern machine learning.
We perform experiments on the effects of a wide array of synthetic minibatch sizes on the Gradient Descent (SG) problem.
arXiv Detail & Related papers (2020-02-09T09:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.