Kernel Learning in Ridge Regression "Automatically" Yields Exact Low
Rank Solution
- URL: http://arxiv.org/abs/2310.11736v2
- Date: Mon, 27 Nov 2023 20:30:54 GMT
- Title: Kernel Learning in Ridge Regression "Automatically" Yields Exact Low
Rank Solution
- Authors: Yunlu Chen, Yang Li, Keli Liu, and Feng Ruan
- Abstract summary: We consider kernels of the form $(x,x') mapsto phi(|x-x'|2_Sigma)$ parametrized by $Sigma$.
We find that the global minimizer of the finite sample kernel learning objective is also low rank with high probability.
- Score: 6.109362130047454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider kernels of the form $(x,x') \mapsto \phi(\|x-x'\|^2_\Sigma)$
parametrized by $\Sigma$. For such kernels, we study a variant of the kernel
ridge regression problem which simultaneously optimizes the prediction function
and the parameter $\Sigma$ of the reproducing kernel Hilbert space. The
eigenspace of the $\Sigma$ learned from this kernel ridge regression problem
can inform us which directions in covariate space are important for prediction.
Assuming that the covariates have nonzero explanatory power for the response
only through a low dimensional subspace (central mean subspace), we find that
the global minimizer of the finite sample kernel learning objective is also low
rank with high probability. More precisely, the rank of the minimizing $\Sigma$
is with high probability bounded by the dimension of the central mean subspace.
This phenomenon is interesting because the low rankness property is achieved
without using any explicit regularization of $\Sigma$, e.g., nuclear norm
penalization.
Our theory makes correspondence between the observed phenomenon and the
notion of low rank set identifiability from the optimization literature. The
low rankness property of the finite sample solutions exists because the
population kernel learning objective grows "sharply" when moving away from its
minimizers in any direction perpendicular to the central mean subspace.
Related papers
- Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - The Optimality of Kernel Classifiers in Sobolev Space [3.3253452228326332]
This paper investigates the statistical performances of kernel classifiers.
We also propose a simple method to estimate the smoothness of $2eta(x)-1$ and apply the method to real datasets.
arXiv Detail & Related papers (2024-02-02T05:23:34Z) - Optimal Rates of Kernel Ridge Regression under Source Condition in Large
Dimensions [15.988264513040903]
We study the large-dimensional behavior of kernel ridge regression (KRR) where the sample size $n asymp dgamma$ for some $gamma > 0$.
Our results illustrate that the curves of rate varying along $gamma$ exhibit the periodic plateau behavior and the multiple descent behavior.
arXiv Detail & Related papers (2024-01-02T16:14:35Z) - Riemannian stochastic optimization methods avoid strict saddle points [68.80251170757647]
We show that policies under study avoid strict saddle points / submanifolds with probability 1.
This result provides an important sanity check as it shows that, almost always, the limit state of an algorithm can only be a local minimizer.
arXiv Detail & Related papers (2023-11-04T11:12:24Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Taming Nonconvexity in Kernel Feature Selection---Favorable Properties
of the Laplace Kernel [77.73399781313893]
A challenge is to establish the objective function of kernel-based feature selection.
The gradient-based algorithms available for non-global optimization are only able to guarantee convergence to local minima.
arXiv Detail & Related papers (2021-06-17T11:05:48Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - RFN: A Random-Feature Based Newton Method for Empirical Risk
Minimization in Reproducing Kernel Hilbert Spaces [14.924672048447334]
Large-scale finite-sum problems can be solved using efficient variants of Newton method, where the Hessian is approximated via sub-samples of data.
In this paper, we observe that for this class of problems, one can naturally use kernel approximation to speed up the Newton method.
We provide a novel second-order algorithm that enjoys local superlinear convergence and global linear convergence.
arXiv Detail & Related papers (2020-02-12T01:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.