Overparameterized random feature regression with nearly orthogonal data
- URL: http://arxiv.org/abs/2211.06077v3
- Date: Sun, 13 Aug 2023 06:23:46 GMT
- Title: Overparameterized random feature regression with nearly orthogonal data
- Authors: Zhichao Wang and Yizhe Zhu
- Abstract summary: We study the non-asymptotic behaviors of the random feature ridge regression (RFRR) given by a two-layer neural network.
Our results hold for a wide variety of activation functions and input data sets that exhibit nearly deterministic properties.
- Score: 21.97381518762387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the properties of random feature ridge regression (RFRR) given
by a two-layer neural network with random Gaussian initialization. We study the
non-asymptotic behaviors of the RFRR with nearly orthogonal deterministic
unit-length input data vectors in the overparameterized regime, where the width
of the first layer is much larger than the sample size. Our analysis shows
high-probability non-asymptotic concentration results for the training errors,
cross-validations, and generalization errors of RFRR centered around their
respective values for a kernel ridge regression (KRR). This KRR is derived from
an expected kernel generated by a nonlinear random feature map. We then
approximate the performance of the KRR by a polynomial kernel matrix obtained
from the Hermite polynomial expansion of the activation function, whose degree
only depends on the orthogonality among different data points. This polynomial
kernel determines the asymptotic behavior of the RFRR and the KRR. Our results
hold for a wide variety of activation functions and input data sets that
exhibit nearly orthogonal properties. Based on these approximations, we obtain
a lower bound for the generalization error of the RFRR for a nonlinear
student-teacher model.
Related papers
- Universality of kernel random matrices and kernel regression in the quadratic regime [18.51014786894174]
In this work, we extend the study of kernel kernel regression to the quadratic regime.
We establish an operator norm approximation bound for the difference between the original kernel random matrix and a quadratic kernel random matrix.
We characterize the precise training and generalization errors for KRR in the quadratic regime when $n/d2$ converges to a nonzero constant.
arXiv Detail & Related papers (2024-08-02T07:29:49Z) - Dimension-free deterministic equivalents and scaling laws for random feature regression [11.607594737176973]
We show that the test error is well approximated by a closed-form expression that only depends on the feature map eigenvalues.
Notably, our approximation guarantee is non-asymptotic, multiplicative, and independent of the feature map dimension.
arXiv Detail & Related papers (2024-05-24T16:43:26Z) - Characterizing Overfitting in Kernel Ridgeless Regression Through the Eigenspectrum [6.749750044497731]
We prove the phenomena of tempered overfitting and catastrophic overfitting under the sub-Gaussian design assumption.
We also identify that the independence of the features plays an important role in guaranteeing tempered overfitting.
arXiv Detail & Related papers (2024-02-02T10:36:53Z) - A Theoretical Analysis of the Test Error of Finite-Rank Kernel Ridge
Regression [23.156642467474995]
finite-rank kernels naturally appear in several machine learning problems, e.g. when fine-tuning a pre-trained deep neural network's last layer to adapt it to a novel task.
We address this gap by deriving sharp non-asymptotic upper and lower bounds for the KRR test error of any finite-rank KRR.
Our bounds are tighter than previously derived bounds on finite-rank KRR, and unlike comparable results, they also remain valid for any regularization parameters.
arXiv Detail & Related papers (2023-10-02T08:52:29Z) - Simplex Random Features [53.97976744884616]
We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels.
We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels.
We show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
arXiv Detail & Related papers (2023-01-31T18:53:39Z) - Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime [22.58196673815647]
kernel ridge regression (KRR) exhibits a multi-phased pattern that crucially depends on the scaling relationship between the sample size $n$ and the underlying dimension $d$.
We show that the learning curves of KRR can have a delicate "double descent" behavior due to specific bias-variance trade-offs at different scaling regimes.
arXiv Detail & Related papers (2022-05-13T17:50:54Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Hybrid Random Features [60.116392415715275]
We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs)
HRFs automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest.
arXiv Detail & Related papers (2021-10-08T20:22:59Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and
Non-Asymptotic Concentration [115.1954841020189]
We study the inequality and non-asymptotic properties of approximation procedures with Polyak-Ruppert averaging.
We prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity.
arXiv Detail & Related papers (2020-04-09T17:54:18Z) - Improved guarantees and a multiple-descent curve for Column Subset
Selection and the Nystr\"om method [76.73096213472897]
We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees.
Our approach leads to significantly better bounds for datasets with known rates of singular value decay.
We show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
arXiv Detail & Related papers (2020-02-21T00:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.