Near Optimal Heteroscedastic Regression with Symbiotic Learning
- URL: http://arxiv.org/abs/2306.14288v2
- Date: Sat, 1 Jul 2023 16:36:17 GMT
- Title: Near Optimal Heteroscedastic Regression with Symbiotic Learning
- Authors: Dheeraj Baby and Aniket Das and Dheeraj Nagaraj and Praneeth
Netrapalli
- Abstract summary: We consider the problem of heteroscedastic linear regression.
We can estimate $mathbfw*$ in squared norm up to an error of $tildeOleft(|mathbff*|2cdot left(frac1n + left(dnright)2right)$ and prove a matching lower bound.
- Score: 29.16456701187538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of heteroscedastic linear regression, where, given
$n$ samples $(\mathbf{x}_i, y_i)$ from $y_i = \langle \mathbf{w}^{*},
\mathbf{x}_i \rangle + \epsilon_i \cdot \langle \mathbf{f}^{*}, \mathbf{x}_i
\rangle$ with $\mathbf{x}_i \sim N(0,\mathbf{I})$, $\epsilon_i \sim N(0,1)$, we
aim to estimate $\mathbf{w}^{*}$. Beyond classical applications of such models
in statistics, econometrics, time series analysis etc., it is also particularly
relevant in machine learning when data is collected from multiple sources of
varying but apriori unknown quality. Our work shows that we can estimate
$\mathbf{w}^{*}$ in squared norm up to an error of
$\tilde{O}\left(\|\mathbf{f}^{*}\|^2 \cdot \left(\frac{1}{n} +
\left(\frac{d}{n}\right)^2\right)\right)$ and prove a matching lower bound
(upto log factors). This represents a substantial improvement upon the previous
best known upper bound of $\tilde{O}\left(\|\mathbf{f}^{*}\|^2\cdot
\frac{d}{n}\right)$. Our algorithm is an alternating minimization procedure
with two key subroutines 1. An adaptation of the classical weighted least
squares heuristic to estimate $\mathbf{w}^{*}$, for which we provide the first
non-asymptotic guarantee. 2. A nonconvex pseudogradient descent procedure for
estimating $\mathbf{f}^{*}$ inspired by phase retrieval. As corollaries, we
obtain fast non-asymptotic rates for two important problems, linear regression
with multiplicative noise and phase retrieval with multiplicative noise, both
of which are of independent interest. Beyond this, the proof of our lower
bound, which involves a novel adaptation of LeCam's method for handling
infinite mutual information quantities (thereby preventing a direct application
of standard techniques like Fano's method), could also be of broader interest
for establishing lower bounds for other heteroscedastic or heavy-tailed
statistical problems.
Related papers
- In-depth Analysis of Low-rank Matrix Factorisation in a Federated Setting [21.002519159190538]
We analyze a distributed algorithm to compute a low-rank matrix factorization on $N$ clients.
We obtain a global $mathbfV$ in $mathbbRd times r$ common to all clients and a local $mathbfUi$ in $mathbbRn_itimes r$.
arXiv Detail & Related papers (2024-09-13T12:28:42Z) - Sample-Efficient Linear Regression with Self-Selection Bias [7.605563562103568]
We consider the problem of linear regression with self-selection bias in the unknown-index setting.
We provide a novel and near optimally sample-efficient (in terms of $k$) algorithm to recover $mathbfw_1,ldots,mathbfw_kin.
Our algorithm succeeds under significantly relaxed noise assumptions, and therefore also succeeds in the related setting of max-linear regression.
arXiv Detail & Related papers (2024-02-22T02:20:24Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Convergence of Alternating Gradient Descent for Matrix Factorization [5.439020425819001]
We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective.
We show that, for a rank-$r$ matrix $mathbfA in mathbbRm times n$, smoothness $C$ in the complexity $T$ to be an absolute constant.
arXiv Detail & Related papers (2023-05-11T16:07:47Z) - Misspecified Phase Retrieval with Generative Priors [15.134280834597865]
We estimate an $n$-dimensional signal $mathbfx$ from $m$ i.i.d.realizations of the single index model $y.
We show that both steps enjoy a statistical rate of order $sqrt(klog L)cdot (log m)/m$ under suitable conditions.
arXiv Detail & Related papers (2022-10-11T16:04:11Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix
Factorization [49.090785356633695]
We study the asymmetric low-rank factorization problem: [mathbfU in mathbbRm min d, mathbfU$ and mathV$.
arXiv Detail & Related papers (2021-06-27T17:25:24Z) - Optimal Mean Estimation without a Variance [103.26777953032537]
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist.
We design an estimator which attains the smallest possible confidence interval as a function of $n,d,delta$.
arXiv Detail & Related papers (2020-11-24T22:39:21Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.