Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant
Kernel
- URL: http://arxiv.org/abs/2311.01762v1
- Date: Fri, 3 Nov 2023 07:43:53 GMT
- Title: Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant
Kernel
- Authors: Oskar Allerbo
- Abstract summary: Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters.
Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper.
We propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training.
- Score: 1.5229257192293204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Kernel ridge regression, KRR, is a generalization of linear ridge regression
that is non-linear in the data, but linear in the parameters. The solution can
be obtained either as a closed-form solution, which includes a matrix
inversion, or iteratively through gradient descent. Using the iterative
approach opens up for changing the kernel during training, something that is
investigated in this paper. We theoretically address the effects this has on
model complexity and generalization. Based on our findings, we propose an
update scheme for the bandwidth of translational-invariant kernels, where we
let the bandwidth decrease to zero during training, thus circumventing the need
for hyper-parameter selection. We demonstrate on real and synthetic data how
decreasing the bandwidth during training outperforms using a constant
bandwidth, selected by cross-validation and marginal likelihood maximization.
We also show theoretically and empirically that using a decreasing bandwidth,
we are able to achieve both zero training error in combination with good
generalization, and a double descent behavior, phenomena that do not occur for
KRR with constant bandwidth but are known to appear for neural networks.
Related papers
- Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Solving Kernel Ridge Regression with Gradient-Based Optimization Methods [1.5229257192293204]
Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters.
We show theoretically and empirically how the $ell_infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions.
arXiv Detail & Related papers (2023-06-29T10:29:29Z) - Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure.
We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates.
In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian
Control [1.5229257192293204]
We propose a closed-form, feather-light, bandwidth selection based on controlling the Jacobian.
We show on real and synthetic data that compared to cross-validation and marginal likelihood, our method is on pair in terms of model performance, but up to six orders of magnitude faster.
arXiv Detail & Related papers (2022-05-24T10:36:05Z) - Error-Correcting Neural Networks for Two-Dimensional Curvature
Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method.
Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Optimal Rates of Distributed Regression with Imperfect Kernels [0.0]
We study the distributed kernel regression via the divide conquer and conquer approach.
We show that the kernel ridge regression can achieve rates faster than $N-1$ in the noise free setting.
arXiv Detail & Related papers (2020-06-30T13:00:16Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.