Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior
- URL: http://arxiv.org/abs/2205.13799v1
- Date: Fri, 27 May 2022 07:23:01 GMT
- Title: Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior
- Authors: Jian Li and Xuanyuan Luo
- Abstract summary: We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD)
We can also obtain new bounds for certain variants of SGD.
- Score: 8.76346911214414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proving algorithm-dependent generalization error bounds for gradient-type
optimization methods has attracted significant attention recently in learning
theory. However, most existing trajectory-based analyses require either
restrictive assumptions on the learning rate (e.g., fast decreasing learning
rate), or continuous injected noise (such as the Gaussian noise in Langevin
dynamics). In this paper, we introduce a new discrete data-dependent prior to
the PAC-Bayesian framework, and prove a high probability generalization bound
of order $O(\frac{1}{n}\cdot
\sum_{t=1}^T(\gamma_t/\varepsilon_t)^2\left\|{\mathbf{g}_t}\right\|^2)$ for
Floored GD (i.e. a version of gradient descent with precision level
$\varepsilon_t$), where $n$ is the number of training samples, $\gamma_t$ is
the learning rate at step $t$, $\mathbf{g}_t$ is roughly the difference of the
gradient computed using all samples and that using only prior samples.
$\left\|{\mathbf{g}_t}\right\|$ is upper bounded by and and typical much
smaller than the gradient norm $\left\|{\nabla f(W_t)}\right\|$. We remark that
our bound holds for nonconvex and nonsmooth scenarios. Moreover, our
theoretical results provide numerically favorable upper bounds of testing
errors (e.g., $0.037$ on MNIST). Using a similar technique, we can also obtain
new generalization bounds for certain variants of SGD. Furthermore, we study
the generalization bounds for gradient Langevin Dynamics (GLD). Using the same
framework with a carefully constructed continuous prior, we show a new high
probability generalization bound of order $O(\frac{1}{n} +
\frac{L^2}{n^2}\sum_{t=1}^T(\gamma_t/\sigma_t)^2)$ for GLD. The new $1/n^2$
rate is due to the concentration of the difference between the gradient of
training samples and that of the prior.
Related papers
- Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping [21.865728815935665]
We provide the first convergence under heavy-tailed noises but without clipping.
We also establish first $mathcalO(Tfrac1-mathfrakp3mathfrakp-2)$ convergence rate in the case where the tail index $mathfrakp$ is unknown in advance.
arXiv Detail & Related papers (2024-12-27T08:46:46Z) - Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Detection of Dense Subhypergraphs by Low-Degree Polynomials [72.4451045270967]
Detection of a planted dense subgraph in a random graph is a fundamental statistical and computational problem.
We consider detecting the presence of a planted $Gr(ngamma, n-alpha)$ subhypergraph in a $Gr(n, n-beta) hypergraph.
Our results are already new in the graph case $r=2$, as we consider the subtle log-density regime where hardness based on average-case reductions is not known.
arXiv Detail & Related papers (2023-04-17T10:38:08Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Making Progress Based on False Discoveries [14.505867475659274]
We consider the question of adaptive data analysis within the framework of convex optimization.
We show that for a general analyst (not necessarily gradient descent) $Omega (1/epsilon3)$ samples are required.
Second, we show that, under certain assumptions on the oracle, in an interaction with gradient descent $tilde Omega (1/epsilon2.5)$ samples are necessary.
arXiv Detail & Related papers (2022-04-19T11:17:10Z) - A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
We consider the problem of finding a saddle point for the convex-concave objective $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth.
We propose an adaptive version of the Condat-Vu algorithm, which alternates between primal gradient steps and dual steps.
arXiv Detail & Related papers (2021-10-28T14:19:30Z) - What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z) - Accelerating Optimization and Reinforcement Learning with
Quasi-Stochastic Approximation [2.294014185517203]
This paper sets out to extend convergence theory to quasi-stochastic approximations.
It is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-09-30T04:44:45Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$.
No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.