Benefit of deep learning with non-convex noisy gradient descent:
Provable excess risk bound and superiority to kernel methods
- URL: http://arxiv.org/abs/2012.03224v1
- Date: Sun, 6 Dec 2020 09:22:16 GMT
- Title: Benefit of deep learning with non-convex noisy gradient descent:
Provable excess risk bound and superiority to kernel methods
- Authors: Taiji Suzuki and Shunta Akiyama
- Abstract summary: We show that any linear estimator can be outperformed by deep learning in a sense of minimax optimal rate.
The excess bounds are so-called fast learning rate which is faster than $O bounds.
- Score: 41.60125423028092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Establishing a theoretical analysis that explains why deep learning can
outperform shallow learning such as kernel methods is one of the biggest issues
in the deep learning literature. Towards answering this question, we evaluate
excess risk of a deep learning estimator trained by a noisy gradient descent
with ridge regularization on a mildly overparameterized neural network, and
discuss its superiority to a class of linear estimators that includes neural
tangent kernel approach, random feature model, other kernel methods, $k$-NN
estimator and so on. We consider a teacher-student regression model, and
eventually show that any linear estimator can be outperformed by deep learning
in a sense of the minimax optimal rate especially for a high dimension setting.
The obtained excess bounds are so-called fast learning rate which is faster
than $O(1/\sqrt{n})$ that is obtained by usual Rademacher complexity analysis.
This discrepancy is induced by the non-convex geometry of the model and the
noisy gradient descent used for neural network training provably reaches a near
global optimal solution even though the loss landscape is highly non-convex.
Although the noisy gradient descent does not employ any explicit or implicit
sparsity inducing regularization, it shows a preferable generalization
performance that dominates linear estimators.
Related papers
- Hebbian learning inspired estimation of the linear regression parameters
from queries [18.374824005225186]
We study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model.
We prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
arXiv Detail & Related papers (2023-09-26T19:00:32Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Convergence rates for gradient descent in the training of
overparameterized artificial neural networks with biases [3.198144010381572]
In recent years, artificial neural networks have developed into a powerful tool for dealing with a multitude of problems for which classical solution approaches.
It is still unclear why randomly gradient descent algorithms reach their limits.
arXiv Detail & Related papers (2021-02-23T18:17:47Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Learning Rates as a Function of Batch Size: A Random Matrix Theory
Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory.
We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks.
We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.