Function Value Learning: Adaptive Learning Rates Based on the Polyak
Stepsize and Function Splitting in ERM
- URL: http://arxiv.org/abs/2307.14528v1
- Date: Wed, 26 Jul 2023 22:12:31 GMT
- Title: Function Value Learning: Adaptive Learning Rates Based on the Polyak
Stepsize and Function Splitting in ERM
- Authors: Guillaume Garrigos, Robert M. Gower, Fabian Schaipp
- Abstract summary: We focus on solving a finite sum-of-terms problem, also known as empirical risk minimization.
We first detail an idealized adaptive method called $textttSPS_+$ that makes use of the sampled loss values.
We then move onto to develop $textttFUVAL$, a variant of $textttSPS_+$ where the loss values at optimality are gradually learned.
- Score: 6.542289202349586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Here we develop variants of SGD (stochastic gradient descent) with an
adaptive step size that make use of the sampled loss values. In particular, we
focus on solving a finite sum-of-terms problem, also known as empirical risk
minimization. We first detail an idealized adaptive method called
$\texttt{SPS}_+$ that makes use of the sampled loss values and assumes
knowledge of the sampled loss at optimality. This $\texttt{SPS}_+$ is a minor
modification of the SPS (Stochastic Polyak Stepsize) method, where the step
size is enforced to be positive. We then show that $\texttt{SPS}_+$ achieves
the best known rates of convergence for SGD in the Lipschitz non-smooth. We
then move onto to develop $\texttt{FUVAL}$, a variant of $\texttt{SPS}_+$ where
the loss values at optimality are gradually learned, as opposed to being given.
We give three viewpoints of $\texttt{FUVAL}$, as a projection based method, as
a variant of the prox-linear method, and then as a particular online SGD
method. We then present a convergence analysis of $\texttt{FUVAL}$ and
experimental results. The shortcomings of our work is that the convergence
analysis of $\texttt{FUVAL}$ shows no advantage over SGD. Another shortcomming
is that currently only the full batch version of $\texttt{FUVAL}$ shows a minor
advantages of GD (Gradient Descent) in terms of sensitivity to the step size.
The stochastic version shows no clear advantage over SGD. We conjecture that
large mini-batches are required to make $\texttt{FUVAL}$ competitive.
Currently the new $\texttt{FUVAL}$ method studied in this paper does not
offer any clear theoretical or practical advantage. We have chosen to make this
draft available online nonetheless because of some of the analysis techniques
we use, such as the non-smooth analysis of $\texttt{SPS}_+$, and also to show
an apparently interesting approach that currently does not work.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Two Sides of One Coin: the Limits of Untuned SGD and the Power of
Adaptive Methods [22.052459124774504]
We show that adaptive methods over untuned SGD alleviating the issue with smoothness and information advantage.
Our results provide theoretical justification for adaptive methods over untuned SGD in the absence of such exponential dependency.
arXiv Detail & Related papers (2023-05-21T14:40:43Z) - Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex
Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems.
We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD)
Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z) - A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm [3.2958527541557525]
Such problems arise frequently in machine learning in the context of robust empirical risk minimization.
We consider the accelerated primal dual (SAPD) algorithm as a robust method against gradient noise.
We show that our method improves upon SAPD both in practice and in theory.
arXiv Detail & Related papers (2022-02-19T22:12:30Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax
Optimization [104.96004056928474]
We propose a class of faster adaptive gradient descent methods for non-strongly-concave minimax problems.
We show that our method reaches a lower sample complexity of $O(kappa2.5epsilon-3)$ with the mini-batch size $O(kappa)$.
arXiv Detail & Related papers (2021-06-30T14:47:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.