Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron
- URL: http://arxiv.org/abs/2303.02255v2
- Date: Mon, 26 Jun 2023 16:39:41 GMT
- Title: Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron
- Authors: Jingfeng Wu and Difan Zou and Zixiang Chen and Vladimir Braverman and
Quanquan Gu and Sham M. Kakade
- Abstract summary: We analyze a Perceptron-type algorithm called GLM-tron and provide its dimension-free risk upper bounds for high-dimensional ReLU regression.
Our results suggest that GLM-tron might be preferable to SGD for high-dimensional ReLU regression.
- Score: 121.10338065441417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers the problem of learning a single ReLU neuron with
squared loss (a.k.a., ReLU regression) in the overparameterized regime, where
the input dimension can exceed the number of samples. We analyze a
Perceptron-type algorithm called GLM-tron (Kakade et al., 2011) and provide its
dimension-free risk upper bounds for high-dimensional ReLU regression in both
well-specified and misspecified settings. Our risk bounds recover several
existing results as special cases. Moreover, in the well-specified setting, we
provide an instance-wise matching risk lower bound for GLM-tron. Our upper and
lower risk bounds provide a sharp characterization of the high-dimensional ReLU
regression problems that can be learned via GLM-tron. On the other hand, we
provide some negative results for stochastic gradient descent (SGD) for ReLU
regression with symmetric Bernoulli data: if the model is well-specified, the
excess risk of SGD is provably no better than that of GLM-tron ignoring
constant factors, for each problem instance; and in the noiseless case,
GLM-tron can achieve a small risk while SGD unavoidably suffers from a constant
risk in expectation. These results together suggest that GLM-tron might be
preferable to SGD for high-dimensional ReLU regression.
Related papers
- Precise analysis of ridge interpolators under heavy correlations -- a Random Duality Theory view [0.0]
We show that emphRandom Duality Theory (RDT) can be utilized to obtain precise closed form characterizations of all estimators related optimizing quantities of interest.
arXiv Detail & Related papers (2024-06-13T14:56:52Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Global Convergence of Sub-gradient Method for Robust Matrix Recovery:
Small Initialization, Noisy Measurements, and Over-parameterization [4.7464518249313805]
Sub-gradient method (SubGM) is used to recover a low-rank matrix from a limited number of measurements.
We show that SubGM converges to the true solution, even under arbitrarily large and arbitrarily dense noise values.
arXiv Detail & Related papers (2022-02-17T17:50:04Z) - Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications.
This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning.
Proximal gradient methods are used as standard approaches to solve OWL regression.
We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z) - Globally-convergent Iteratively Reweighted Least Squares for Robust
Regression Problems [15.823258699608994]
We provide the first global model recovery results for the IRLS (iteratively reweighted least squares) for robust regression problems.
We propose augmentations to the basic IRLS routine that not only offer guaranteed global recovery, but in practice also outperform state-of-the-art algorithms for robust regression.
arXiv Detail & Related papers (2020-06-25T07:16:13Z) - Interpolating Predictors in High-Dimensional Factor Regression [2.1055643409860743]
This work studies finite-sample properties of the risk of the minimum-norm interpolating predictor in high-dimensional regression models.
We show that the min-norm interpolating predictor can have similar risk to predictors based on principal components regression and ridge regression, and can improve over LASSO based predictors, in the high-dimensional regime.
arXiv Detail & Related papers (2020-02-06T22:08:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.