Quantifying the Benefit of Using Differentiable Learning over Tangent
Kernels
- URL: http://arxiv.org/abs/2103.01210v1
- Date: Mon, 1 Mar 2021 18:54:13 GMT
- Title: Quantifying the Benefit of Using Differentiable Learning over Tangent
Kernels
- Authors: Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro
- Abstract summary: We show that gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing.
We show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.
- Score: 51.18614735359657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the relative power of learning with gradient descent on
differentiable models, such as neural networks, versus using the corresponding
tangent kernels. We show that under certain conditions, gradient descent
achieves small error only if a related tangent kernel method achieves a
non-trivial advantage over random guessing (a.k.a. weak learning), though this
advantage might be very small even when gradient descent can achieve
arbitrarily high accuracy. Complementing this, we show that without these
conditions, gradient descent can in fact learn with small error even when no
kernel method, in particular using the tangent kernel, can achieve a
non-trivial advantage over random guessing.
Related papers
- An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes [17.804065824245402]
In machine learning applications, each loss function is non-negative and can be expressed as the composition of a square and its real-valued square root.
We show how to apply the Gauss-Newton method or the Levssquardt method to minimize the average of smooth but possibly non-negative functions.
arXiv Detail & Related papers (2024-07-05T08:53:06Z) - How to guess a gradient [68.98681202222664]
We show that gradients are more structured than previously thought.
Exploiting this structure can significantly improve gradient-free optimization schemes.
We highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
arXiv Detail & Related papers (2023-12-07T21:40:44Z) - Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum [18.10812063219831]
We introduce Modified Spectrum Kernels (MSKs) to approximate kernels with desired eigenvalues.
We propose a preconditioned gradient descent method, which alters the trajectory of gradient descent.
Our method is both computationally efficient and simple to implement.
arXiv Detail & Related papers (2023-07-26T22:39:47Z) - Stochastic Marginal Likelihood Gradients using Neural Tangent Kernels [78.6096486885658]
We introduce lower bounds to the linearized Laplace approximation of the marginal likelihood.
These bounds are amenable togradient-based optimization and allow to trade off estimation accuracy against computational complexity.
arXiv Detail & Related papers (2023-06-06T19:02:57Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Benefit of deep learning with non-convex noisy gradient descent:
Provable excess risk bound and superiority to kernel methods [41.60125423028092]
We show that any linear estimator can be outperformed by deep learning in a sense of minimax optimal rate.
The excess bounds are so-called fast learning rate which is faster than $O bounds.
arXiv Detail & Related papers (2020-12-06T09:22:16Z) - Tight Second-Order Certificates for Randomized Smoothing [106.06908242424481]
We show that there also exists a universal curvature-like bound for Gaussian random smoothing.
In addition to proving the correctness of this novel certificate, we show that SoS certificates are realizable and therefore tight.
arXiv Detail & Related papers (2020-10-20T18:03:45Z) - Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent.
Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.