Related papers: The Sample Complexity of Gradient Descent in Stochastic Convex Optimization

The Sample Complexity of Gradient Descent in Stochastic Convex Optimization

URL: http://arxiv.org/abs/2404.04931v2
Date: Thu, 11 Apr 2024 02:32:43 GMT
Title: The Sample Complexity of Gradient Descent in Stochastic Convex Optimization
Authors: Roi Livni,
Abstract summary: We show that the generalization error of full-batch Gradient Descent can be $tilde Theta(d/m + 1/sqrtm)$, where $d$ is the dimension and $m$ is the sample size. This matches the sample complexity of emphworst-case empirical risk minimizers.
Score: 14.268363583731848
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We analyze the sample complexity of full-batch Gradient Descent (GD) in the setup of non-smooth Stochastic Convex Optimization. We show that the generalization error of GD, with common choice of hyper-parameters, can be $\tilde \Theta(d/m + 1/\sqrt{m})$, where $d$ is the dimension and $m$ is the sample size. This matches the sample complexity of \emph{worst-case} empirical risk minimizers. That means that, in contrast with other algorithms, GD has no advantage over naive ERMs. Our bound follows from a new generalization bound that depends on both the dimension as well as the learning rate and number of iterations. Our bound also shows that, for general hyper-parameters, when the dimension is strictly larger than number of samples, $T=\Omega(1/\epsilon^4)$ iterations are necessary to avoid overfitting. This resolves an open problem by Schlisserman et al.23 and Amir er Al.21, and improves over previous lower bounds that demonstrated that the sample size must be at least square root of the dimension.

Related papers

Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization [85.91302339486673]
We study robust Markov decision processes (RMDPs) with general policy parameterization under s-rectangular and non-rectangular uncertainty sets.<n>We prove novel Lipschitz and Lipschitz-smoothness properties for general policy parameterizations that extends to infinite state spaces.<n>We design a projected gradient descent algorithm for s-rectangular uncertainty and a Frank-Wolfe algorithm for non-rectangular uncertainty.
arXiv Detail & Related papers (2026-02-11T21:44:20Z)
All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension [14.982451024975733]
We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit.<n>We provide a novel generalization lower bound of $left(sqrtT/m1.5right)$ for Gradient Descent, where $$ is the learning rate, $T$ is the horizon and $m$ is the sample size.
arXiv Detail & Related papers (2026-02-09T07:33:01Z)
Parameter-free Algorithms for the Stochastically Extended Adversarial Model [59.81852138768642]
Existing approaches for the Extended Adversarial (SEA) model require prior knowledge of problem-specific parameters, such as the diameter of the domain $D$ and the Lipschitz constant of the loss functions $G$.<n>We develop parameter-free methods by leveraging the Optimistic Online Newton Step (OONS) algorithm to eliminate the need for these parameters.
arXiv Detail & Related papers (2025-10-06T10:53:37Z)
Group Distributionally Robust Optimization with Flexible Sample Queries [41.4457693520265]
Group distributionally robust optimization (GDRO) aims to develop models that perform well across $m$ distributions simultaneously.<n>Existing GDRO algorithms can only process a fixed number of samples per iteration, either 1 or $m$.<n>We develop a GDRO algorithm that allows an arbitrary and varying sample size per round, achieving a high-probability optimization error bound of $Oleft(frac1tsqrtsum_j=1t fracmr_jlog mright)$.
arXiv Detail & Related papers (2025-05-21T07:41:16Z)
Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness [14.98493572536424]
Recent results in non-stationary optimization demonstrate the convergence of popular adaptive algorithms.<n> complexity of convergence is a higher-order in terms of problem parameters like smoothness constants.
arXiv Detail & Related papers (2025-05-07T17:40:12Z)
Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality. We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z)
Precise Asymptotics of Bagging Regularized M-estimators [5.165142221427928]
We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators. Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples. Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
arXiv Detail & Related papers (2024-09-23T17:48:28Z)
Stochastic First-Order Methods with Non-smooth and Non-Euclidean Proximal Terms for Nonconvex High-Dimensional Stochastic Optimization [2.0657831823662574]
When the non problem is by which the non problem is by whichity, the sample of first-order methods may depend linearly on the problem dimension, is for undesirable problems. Our algorithms allow for the estimate of complexity using the distance of. mathO (log d) / EuM4. We prove that DISFOM can sharpen variance employing $mathO (log d) / EuM4.
arXiv Detail & Related papers (2024-06-27T18:38:42Z)
Measuring quantum relative entropy with finite-size effect [53.64687146666141]
We study the estimation of relative entropy $D(rho|sigma)$ when $sigma$ is known. Our estimator attains the Cram'er-Rao type bound when the dimension $d$ is fixed.
arXiv Detail & Related papers (2024-06-25T06:07:20Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization [30.26365073195728]
We study the generalization performance of gradient methods in the fundamental convex optimization setting. We show that an application of the same construction technique provides a similar $Omega(sqrtd)$ lower bound for the sample complexity of SGD to reach a non-trivial empirical error.
arXiv Detail & Related papers (2024-01-22T15:50:32Z)
The Sample Complexity Of ERMs In Stochastic Convex Optimization [13.896417716930687]
We show that in fact $tildeO(fracdepsilon+frac1epsilon2)$ data points are also sufficient. We further generalize the result and show that a similar upper bound holds for all convex bodies.
arXiv Detail & Related papers (2023-11-09T14:29:25Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Robust High Dimensional Expectation Maximization Algorithm via Trimmed Hard Thresholding [24.184520829631587]
We study the problem of estimating latent variable models with arbitrarily corrupted samples in high dimensional space. We propose a method called Trimmed (Gradient) Expectation Maximization which adds a trimming gradient step. We show that the algorithm is corruption-proofing and converges to the (near) optimal statistical rate geometrically.
arXiv Detail & Related papers (2020-10-19T15:00:35Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.