Statistical Learning under Heterogeneous Distribution Shift
- URL: http://arxiv.org/abs/2302.13934v4
- Date: Fri, 27 Oct 2023 16:47:13 GMT
- Title: Statistical Learning under Heterogeneous Distribution Shift
- Authors: Max Simchowitz, Anurag Ajay, Pulkit Agrawal, Akshay Krishnamurthy
- Abstract summary: Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
- Score: 71.8393170225794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the prediction of a target $\mathbf{z}$ from a pair of
random variables $(\mathbf{x},\mathbf{y})$, where the ground-truth predictor is
additive $\mathbb{E}[\mathbf{z} \mid \mathbf{x},\mathbf{y}] =
f_\star(\mathbf{x}) +g_{\star}(\mathbf{y})$. We study the performance of
empirical risk minimization (ERM) over functions $f+g$, $f \in F$ and $g \in
G$, fit on a given training distribution, but evaluated on a test distribution
which exhibits covariate shift. We show that, when the class $F$ is "simpler"
than $G$ (measured, e.g., in terms of its metric entropy), our predictor is
more resilient to heterogeneous covariate shifts} in which the shift in
$\mathbf{x}$ is much greater than that in $\mathbf{y}$. Our analysis proceeds
by demonstrating that ERM behaves qualitatively similarly to orthogonal machine
learning: the rate at which ERM recovers the $f$-component of the predictor has
only a lower-order dependence on the complexity of the class $G$, adjusted for
partial non-indentifiability introduced by the additive structure. These
results rely on a novel H\"older style inequality for the Dudley integral which
may be of independent interest. Moreover, we corroborate our theoretical
findings with experiments demonstrating improved resilience to shifts in
"simpler" features across numerous domains.
Related papers
- Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss [33.18537822803389]
We show that whenever the topologies of $L2$ and $Psi_p$ are comparable on our hypothesis class $mathscrF$, $mathscrF$ is a weakly sub-Gaussian class.
Our result holds whether the problem is realizable or not and we refer to this as a emphnear mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term.
arXiv Detail & Related papers (2024-02-08T18:57:42Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Universality laws for Gaussian mixtures in generalized linear models [22.154969876570238]
We investigate the joint statistics of the family of generalized linear estimators $(Theta_1, dots, Theta_M)$.
This allow us to prove the universality of different quantities of interest, such as the training and generalization errors.
We discuss the applications of our results to different machine learning tasks of interest, such as ensembling and uncertainty.
arXiv Detail & Related papers (2023-02-17T15:16:06Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Universality of empirical risk minimization [12.764655736673749]
Consider supervised learning from i.i.d. samples where $boldsymbol x_i inmathbbRp$ are feature vectors and $y in mathbbR$ are labels.
We study empirical risk universality over a class of functions that are parameterized by $mathsfk.
arXiv Detail & Related papers (2022-02-17T18:53:45Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.