Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior
- URL: http://arxiv.org/abs/2205.13799v1
- Date: Fri, 27 May 2022 07:23:01 GMT
- Title: Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior
- Authors: Jian Li and Xuanyuan Luo
- Abstract summary: We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD)
We can also obtain new bounds for certain variants of SGD.
- Score: 8.76346911214414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proving algorithm-dependent generalization error bounds for gradient-type
optimization methods has attracted significant attention recently in learning
theory. However, most existing trajectory-based analyses require either
restrictive assumptions on the learning rate (e.g., fast decreasing learning
rate), or continuous injected noise (such as the Gaussian noise in Langevin
dynamics). In this paper, we introduce a new discrete data-dependent prior to
the PAC-Bayesian framework, and prove a high probability generalization bound
of order $O(\frac{1}{n}\cdot
\sum_{t=1}^T(\gamma_t/\varepsilon_t)^2\left\|{\mathbf{g}_t}\right\|^2)$ for
Floored GD (i.e. a version of gradient descent with precision level
$\varepsilon_t$), where $n$ is the number of training samples, $\gamma_t$ is
the learning rate at step $t$, $\mathbf{g}_t$ is roughly the difference of the
gradient computed using all samples and that using only prior samples.
$\left\|{\mathbf{g}_t}\right\|$ is upper bounded by and and typical much
smaller than the gradient norm $\left\|{\nabla f(W_t)}\right\|$. We remark that
our bound holds for nonconvex and nonsmooth scenarios. Moreover, our
theoretical results provide numerically favorable upper bounds of testing
errors (e.g., $0.037$ on MNIST). Using a similar technique, we can also obtain
new generalization bounds for certain variants of SGD. Furthermore, we study
the generalization bounds for gradient Langevin Dynamics (GLD). Using the same
framework with a carefully constructed continuous prior, we show a new high
probability generalization bound of order $O(\frac{1}{n} +
\frac{L^2}{n^2}\sum_{t=1}^T(\gamma_t/\sigma_t)^2)$ for GLD. The new $1/n^2$
rate is due to the concentration of the difference between the gradient of
training samples and that of the prior.
Related papers
- Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Detection of Dense Subhypergraphs by Low-Degree Polynomials [72.4451045270967]
Detection of a planted dense subgraph in a random graph is a fundamental statistical and computational problem.
We consider detecting the presence of a planted $Gr(ngamma, n-alpha)$ subhypergraph in a $Gr(n, n-beta) hypergraph.
Our results are already new in the graph case $r=2$, as we consider the subtle log-density regime where hardness based on average-case reductions is not known.
arXiv Detail & Related papers (2023-04-17T10:38:08Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Making Progress Based on False Discoveries [14.505867475659274]
We consider the question of adaptive data analysis within the framework of convex optimization.
We show that for a general analyst (not necessarily gradient descent) $Omega (1/epsilon3)$ samples are required.
Second, we show that, under certain assumptions on the oracle, in an interaction with gradient descent $tilde Omega (1/epsilon2.5)$ samples are necessary.
arXiv Detail & Related papers (2022-04-19T11:17:10Z) - A first-order primal-dual method with adaptivity to local smoothness [64.62056765216386]
We consider the problem of finding a saddle point for the convex-concave objective $min_x max_y f(x) + langle Ax, yrangle - g*(y)$, where $f$ is a convex function with locally Lipschitz gradient and $g$ is convex and possibly non-smooth.
We propose an adaptive version of the Condat-Vu algorithm, which alternates between primal gradient steps and dual steps.
arXiv Detail & Related papers (2021-10-28T14:19:30Z) - What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z) - Accelerating Optimization and Reinforcement Learning with
Quasi-Stochastic Approximation [2.294014185517203]
This paper sets out to extend convergence theory to quasi-stochastic approximations.
It is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-09-30T04:44:45Z) - Fast Dimension Independent Private AdaGrad on Publicly Estimated
Subspaces [24.52697154861819]
We show that noisy AdaGrad achieves a regret comparable to traditional AdaGrad plus a well-controlled term due to noise.
We operate with general convex functions in both constrained and unconstrained minimization.
arXiv Detail & Related papers (2020-08-14T20:46:38Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$.
No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.