The Dimension Strikes Back with Gradients: Generalization of Gradient
Methods in Stochastic Convex Optimization
- URL: http://arxiv.org/abs/2401.12058v1
- Date: Mon, 22 Jan 2024 15:50:32 GMT
- Title: The Dimension Strikes Back with Gradients: Generalization of Gradient
Methods in Stochastic Convex Optimization
- Authors: Matan Schliserman and Uri Sherman and Tomer Koren
- Abstract summary: We study the generalization performance of gradient methods in the fundamental convex optimization setting.
We show that an application of the same construction technique provides a similar $Omega(sqrtd)$ lower bound for the sample complexity of SGD to reach a non-trivial empirical error.
- Score: 30.26365073195728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the generalization performance of gradient methods in the
fundamental stochastic convex optimization setting, focusing on its dimension
dependence. First, for full-batch gradient descent (GD) we give a construction
of a learning problem in dimension $d=O(n^2)$, where the canonical version of
GD (tuned for optimal performance of the empirical risk) trained with $n$
training examples converges, with constant probability, to an approximate
empirical risk minimizer with $\Omega(1)$ population excess risk. Our bound
translates to a lower bound of $\Omega (\sqrt{d})$ on the number of training
examples required for standard GD to reach a non-trivial test error, answering
an open question raised by Feldman (2016) and Amir, Koren, and Livni (2021b)
and showing that a non-trivial dimension dependence is unavoidable.
Furthermore, for standard one-pass stochastic gradient descent (SGD), we show
that an application of the same construction technique provides a similar
$\Omega(\sqrt{d})$ lower bound for the sample complexity of SGD to reach a
non-trivial empirical error, despite achieving optimal test performance. This
again provides an exponential improvement in the dimension dependence compared
to previous work (Koren, Livni, Mansour, and Sherman, 2022), resolving an open
question left therein.
Related papers
- Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
We show that AdaGrad outperforms SGD by a factor of $d$ under certain conditions.
Motivated by this, we introduce assumptions on the smoothness structure of the objective and the gradient variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z) - Restricted Strong Convexity of Deep Learning Models with Smooth
Activations [31.003601717265006]
We study the problem of optimization of deep learning models with smooth activation functions.
We introduce a new analysis of optimization based on Restricted Strong Convexity (RSC)
Ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models.
arXiv Detail & Related papers (2022-09-29T21:24:26Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - On Stochastic Moving-Average Estimators for Non-Convex Optimization [105.22760323075008]
In this paper, we demonstrate the power of a widely used estimator based on moving average (SEMA) problems.
For all these-the-art results, we also present the results for all these-the-art problems.
arXiv Detail & Related papers (2021-04-30T08:50:24Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z) - Robust High Dimensional Expectation Maximization Algorithm via Trimmed
Hard Thresholding [24.184520829631587]
We study the problem of estimating latent variable models with arbitrarily corrupted samples in high dimensional space.
We propose a method called Trimmed (Gradient) Expectation Maximization which adds a trimming gradient step.
We show that the algorithm is corruption-proofing and converges to the (near) optimal statistical rate geometrically.
arXiv Detail & Related papers (2020-10-19T15:00:35Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Better Theory for SGD in the Nonconvex World [2.6397379133308214]
Large-scale non optimization problems are ubiquitous in modern machine learning.
We perform experiments on the effects of a wide array of synthetic minibatch sizes on the Gradient Descent (SG) problem.
arXiv Detail & Related papers (2020-02-09T09:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.