On the Optimal Weighted $\ell_2$ Regularization in Overparameterized
Linear Regression
- URL: http://arxiv.org/abs/2006.05800v4
- Date: Tue, 3 Nov 2020 02:20:13 GMT
- Title: On the Optimal Weighted $\ell_2$ Regularization in Overparameterized
Linear Regression
- Authors: Denny Wu and Ji Xu
- Abstract summary: We consider the linear model $mathbfy = mathbfX mathbfbeta_star + mathbfepsilon$ with $mathbfXin mathbbRntimes p$ in the overparameterized regime $p>n$.
We provide an exact characterization of the prediction risk $mathbbE(y-mathbfxThatmathbfbeta_lambda)2$ in proportional limit $p/n
- Score: 23.467801864841526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the linear model $\mathbf{y} = \mathbf{X} \mathbf{\beta}_\star +
\mathbf{\epsilon}$ with $\mathbf{X}\in \mathbb{R}^{n\times p}$ in the
overparameterized regime $p>n$. We estimate $\mathbf{\beta}_\star$ via
generalized (weighted) ridge regression: $\hat{\mathbf{\beta}}_\lambda =
\left(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{\Sigma}_w\right)^\dagger
\mathbf{X}^T\mathbf{y}$, where $\mathbf{\Sigma}_w$ is the weighting matrix.
Under a random design setting with general data covariance $\mathbf{\Sigma}_x$
and anisotropic prior on the true coefficients
$\mathbb{E}\mathbf{\beta}_\star\mathbf{\beta}_\star^T = \mathbf{\Sigma}_\beta$,
we provide an exact characterization of the prediction risk
$\mathbb{E}(y-\mathbf{x}^T\hat{\mathbf{\beta}}_\lambda)^2$ in the proportional
asymptotic limit $p/n\rightarrow \gamma \in (1,\infty)$. Our general setup
leads to a number of interesting findings. We outline precise conditions that
decide the sign of the optimal setting $\lambda_{\rm opt}$ for the ridge
parameter $\lambda$ and confirm the implicit $\ell_2$ regularization effect of
overparameterization, which theoretically justifies the surprising empirical
observation that $\lambda_{\rm opt}$ can be negative in the overparameterized
regime. We also characterize the double descent phenomenon for principal
component regression (PCR) when both $\mathbf{X}$ and $\mathbf{\beta}_\star$
are anisotropic. Finally, we determine the optimal weighting matrix
$\mathbf{\Sigma}_w$ for both the ridgeless ($\lambda\to 0$) and optimally
regularized ($\lambda = \lambda_{\rm opt}$) case, and demonstrate the advantage
of the weighted objective over standard ridge regression and PCR.
Related papers
- In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting [21.002519159190538]
We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients.
We obtain a global $mathbfV$ in $mathbbRd times r$ common to all clients and a local $mathbfUi$ in $mathbbRn_itimes r$.
arXiv Detail & Related papers (2024-09-13T12:28:42Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Piecewise Linearity of Min-Norm Solution Map of a Nonconvexly Regularized Convex Sparse Model [8.586951231230596]
We study the piecewise constant sparsity pattern $mathbfx_star(mathbfy,da)$ in each linear zone.
We iteratively computes the closed-form expression of $mathbfx_star(mathbfy,da)$ in each linear zone.
arXiv Detail & Related papers (2023-11-30T10:39:47Z) - Optimal Estimator for Linear Regression with Shuffled Labels [17.99906229036223]
This paper considers the task of linear regression with shuffled labels.
$mathbf Y in mathbb Rntimes m, mathbf Pi in mathbb Rntimes p, mathbf B in mathbb Rptimes m$, and $mathbf Win mathbb Rntimes m$, respectively.
arXiv Detail & Related papers (2023-10-02T16:44:47Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Statistically Optimal Robust Mean and Covariance Estimation for
Anisotropic Gaussians [3.5788754401889014]
In the strong $varepsilon$-contamination model we assume that the adversary replaced an $varepsilon$ fraction of vectors in the original Gaussian sample by any other vectors.
We construct an estimator $widehat Sigma of the cofrac matrix $Sigma that satisfies, with probability at least $1 - delta.
arXiv Detail & Related papers (2023-01-21T23:28:55Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - On the computational and statistical complexity of over-parameterized
matrix sensing [30.785670369640872]
We consider solving the low rank matrix sensing problem with Factorized Gradient Descend (FGD) method.
By decomposing the factorized matrix $mathbfF$ into separate column spaces, we show that $|mathbfF_t - mathbfF_t - mathbfX*|_F2$ converges to a statistical error.
arXiv Detail & Related papers (2021-01-27T04:23:49Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.