Precise Learning Curves and Higher-Order Scaling Limits for Dot Product
Kernel Regression
- URL: http://arxiv.org/abs/2205.14846v3
- Date: Mon, 12 Jun 2023 13:22:21 GMT
- Title: Precise Learning Curves and Higher-Order Scaling Limits for Dot Product
Kernel Regression
- Authors: Lechao Xiao, Hong Hu, Theodor Misiakiewicz, Yue M. Lu, Jeffrey
Pennington
- Abstract summary: We focus on the problem of kernel ridge regression for dot-product kernels.
We observe a peak in the learning curve whenever $m approx dr/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.
- Score: 41.48538038768993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As modern machine learning models continue to advance the computational
frontier, it has become increasingly important to develop precise estimates for
expected performance improvements under different model and data scaling
regimes. Currently, theoretical understanding of the learning curves that
characterize how the prediction error depends on the number of samples is
restricted to either large-sample asymptotics ($m\to\infty$) or, for certain
simple data distributions, to the high-dimensional asymptotics in which the
number of samples scales linearly with the dimension ($m\propto d$). There is a
wide gulf between these two regimes, including all higher-order scaling
relations $m\propto d^r$, which are the subject of the present paper. We focus
on the problem of kernel ridge regression for dot-product kernels and present
precise formulas for the mean of the test error, bias, and variance, for data
drawn uniformly from the sphere with isotropic random labels in the $r$th-order
asymptotic scaling regime $m\to\infty$ with $m/d^r$ held constant. We observe a
peak in the learning curve whenever $m \approx d^r/r!$ for any integer $r$,
leading to multiple sample-wise descent and nontrivial behavior at multiple
scales.
Related papers
- Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - Unbiased Kinetic Langevin Monte Carlo with Inexact Gradients [0.8749675983608172]
We present an unbiased method for posterior means based on kinetic Langevin dynamics.
Our proposed estimator is unbiased, attains finite variance, and satisfies a central limit theorem.
Our results demonstrate that in large-scale applications, the unbiased algorithm we present can be 2-3 orders of magnitude more efficient than the gold-standard" randomized Hamiltonian Monte Carlo.
arXiv Detail & Related papers (2023-11-08T21:19:52Z) - Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood
Estimation for Latent Gaussian Models [69.22568644711113]
We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversions.
Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation.
In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.
arXiv Detail & Related papers (2023-06-05T21:08:34Z) - Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime [22.58196673815647]
kernel ridge regression (KRR) exhibits a multi-phased pattern that crucially depends on the scaling relationship between the sample size $n$ and the underlying dimension $d$.
We show that the learning curves of KRR can have a delicate "double descent" behavior due to specific bias-variance trade-offs at different scaling regimes.
arXiv Detail & Related papers (2022-05-13T17:50:54Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Max-Linear Regression by Convex Programming [5.366354612549172]
We formulate and analyze a scalable convex program given by anchored regression (AR) as the estimator for the max-linear regression problem.
Our result shows a sufficient number of noise-free observations for exact recovery scales as $k4p$ up to a logarithmic factor.
arXiv Detail & Related papers (2021-03-12T00:55:54Z) - A Random Matrix Analysis of Random Fourier Features: Beyond the Gaussian
Kernel, a Precise Phase Transition, and the Corresponding Double Descent [85.77233010209368]
This article characterizes the exacts of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$ is all large and comparable.
This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$.
arXiv Detail & Related papers (2020-06-09T02:05:40Z) - A Precise High-Dimensional Asymptotic Theory for Boosting and
Minimum-$\ell_1$-Norm Interpolated Classifiers [3.167685495996986]
This paper establishes a precise high-dimensional theory for boosting on separable data.
Under a class of statistical models, we provide an exact analysis of the universality error of boosting.
We also explicitly pin down the relation between the boosting test error and the optimal Bayes error.
arXiv Detail & Related papers (2020-02-05T00:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.