Related papers: Optimal Rates and Saturation for Noiseless Kernel Ridge Regression

Optimal Rates and Saturation for Noiseless Kernel Ridge Regression

URL: http://arxiv.org/abs/2402.15718v2
Date: Fri, 11 Apr 2025 12:55:40 GMT
Title: Optimal Rates and Saturation for Noiseless Kernel Ridge Regression
Authors: Jihao Long, Xiaojun Peng, Lei Wu,
Abstract summary: We present a comprehensive study of Kernel ridge regression (KRR) in the noiseless regime.<n>KRR is a fundamental method for learning functions from finite samples.<n>We introduce a refined notion of degrees of freedom, which we believe has broader applicability in the analysis of kernel methods.
Score: 4.585021053685196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Kernel ridge regression (KRR), also known as the least-squares support vector machine, is a fundamental method for learning functions from finite samples. While most existing analyses focus on the noisy setting with constant-level label noise, we present a comprehensive study of KRR in the noiseless regime -- a critical setting in scientific computing where data are often generated via high-fidelity numerical simulations. We establish that, up to logarithmic factors, noiseless KRR achieves minimax optimal convergence rates, jointly determined by the eigenvalue decay of the associated integral operator and the target function's smoothness. These rates are derived under Sobolev-type interpolation norms, with the $L^2$ norm as a special case. Notably, we uncover two key phenomena: an extra-smoothness effect, where the KRR solution exhibits higher smoothness than typical functions in the native reproducing kernel Hilbert space (RKHS), and a saturation effect, where the KRR's adaptivity to the target function's smoothness plateaus beyond a certain level. Leveraging these insights, we also derive a novel error bound for noisy KRR that is noise-level aware and achieves minimax optimality in both noiseless and noisy regimes. As a key technical contribution, we introduce a refined notion of degrees of freedom, which we believe has broader applicability in the analysis of kernel methods. Extensive numerical experiments validate our theoretical results and provide insights beyond existing theory.

Related papers

Optimal and Structure-Adaptive CATE Estimation with Kernel Ridge Regression [2.538209532048867]
We propose an optimal algorithm for estimating conditional average treatment effects (CATEs) when response functions lie in a reproducing kernel Hilbert space (RKHS)<n>We develop a unified two-stage kernel ridge regression (KRR) method that attains minimax rates governed by the complexity of the contrast function rather than the nuisance class.
arXiv Detail & Related papers (2026-02-21T21:29:31Z)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
Semi-Implicit Functional Gradient Flow for Efficient Sampling [30.32233517392456]
We propose a functional gradient ParVI method that uses perturbed particles with Gaussian noise as the approximation family. We show that the corresponding functional gradient flow, which can be estimated via denoising score matching with neural networks, exhibits strong theoretical convergence guarantees. In addition, we present an adaptive version of our method that automatically selects the appropriate noise magnitude during sampling.
arXiv Detail & Related papers (2024-10-23T15:00:30Z)
A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression [6.749750044497731]
This paper conducts a comprehensive study of the learning curves of kernel ridge regression (KRR) under minimal assumptions. We analyze the role of key properties of the kernel, such as its spectral eigen-decay, the characteristics of the eigenfunctions, and the smoothness of the kernel.
arXiv Detail & Related papers (2024-10-23T11:52:52Z)
On the Wasserstein Convergence and Straightness of Rectified Flow [54.580605276017096]
Rectified Flow (RF) is a generative model that aims to learn straight flow trajectories from noise to data. We provide a theoretical analysis of the Wasserstein distance between the sampling distribution of RF and the target distribution. We present general conditions guaranteeing uniqueness and straightness of 1-RF, which is in line with previous empirical findings.
arXiv Detail & Related papers (2024-10-19T02:36:11Z)
Optimal Kernel Quantile Learning with Random Features [0.9208007322096533]
This paper presents a generalization study of kernel quantile regression with random features (KQR-RF) Our study establishes the capacity-dependent learning rates for KQR-RF under mild conditions on the number of RFs. By slightly modifying our assumptions, the capacity-dependent error analysis can also be applied to cases with Lipschitz continuous losses.
arXiv Detail & Related papers (2024-08-24T14:26:09Z)
Learning Analysis of Kernel Ridgeless Regression with Asymmetric Kernel Learning [33.34053480377887]
This paper enhances kernel ridgeless regression with Locally-Adaptive-Bandwidths (LAB) RBF kernels. For the first time, we demonstrate that functions learned from LAB RBF kernels belong to an integral space of Reproducible Kernel Hilbert Spaces (RKHSs)
arXiv Detail & Related papers (2024-06-03T15:28:12Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Understanding the Generalization Ability of Deep Learning Algorithms: A Kernelized Renyi's Entropy Perspective [11.255943520955764]
We propose a novel information theoretical measure: kernelized Renyi's entropy. We establish the generalization error bounds for gradient/Langevin descent (SGD/SGLD) learning algorithms under kernelized Renyi's entropy. We show that our information-theoretical bounds depend on the statistics of the gradients, and are rigorously tighter than the current state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-05-02T01:17:15Z)
Statistical Optimality of Divide and Conquer Kernel-based Functional Linear Regression [1.7227952883644062]
This paper studies the convergence performance of divide-and-conquer estimators in the scenario that the target function does not reside in the underlying kernel space. As a decomposition-based scalable approach, the divide-and-conquer estimators of functional linear regression can substantially reduce the algorithmic complexities in time and memory.
arXiv Detail & Related papers (2022-11-20T12:29:06Z)
Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented. $p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z)
Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level. We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)
On the Role of Entropy-based Loss for Learning Causal Structures with Continuous Optimization [27.613220411996025]
A method with non-combinatorial directed acyclic constraint, called NOTEARS, formulates the causal structure learning problem as a continuous optimization problem using least-square loss. We show that the violation of the Gaussian noise assumption will hinder the causal direction identification. We propose a more general entropy-based loss that is theoretically consistent with the likelihood score under any noise distribution.
arXiv Detail & Related papers (2021-06-05T08:29:51Z)
Fast Statistical Leverage Score Approximation in Kernel Ridge Regression [12.258887270632869]
Nystr"om approximation is a fast randomized method that rapidly solves kernel ridge regression (KRR) problems. We propose a linear time (modulo poly-log terms) algorithm to accurately approximate the statistical leverage scores in the stationary- Kernel-based KRR.
arXiv Detail & Related papers (2021-03-09T05:57:08Z)
Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate. We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models. We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate. We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
SLEIPNIR: Deterministic and Provably Accurate Feature Expansion for Gaussian Process Regression with Derivatives [86.01677297601624]
We propose a novel approach for scaling GP regression with derivatives based on quadrature Fourier features. We prove deterministic, non-asymptotic and exponentially fast decaying error bounds which apply for both the approximated kernel as well as the approximated posterior.
arXiv Detail & Related papers (2020-03-05T14:33:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.