Towards a statistical theory of data selection under weak supervision
- URL: http://arxiv.org/abs/2309.14563v2
- Date: Wed, 4 Oct 2023 04:39:34 GMT
- Title: Towards a statistical theory of data selection under weak supervision
- Authors: Germain Kolossov, Andrea Montanari, Pulkit Tandon
- Abstract summary: Given a sample of size $N$, it is often useful to select a subsample of smaller size $nN$ to be used for statistical estimation or learning.
We assume to be given $N$ unlabeled samples $bold x_i_ile N$, and to be given access to a surrogate model' that can predict labels $y_i$ better than random guessing.
- Score: 7.540077751816086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a sample of size $N$, it is often useful to select a subsample of
smaller size $n<N$ to be used for statistical estimation or learning. Such a
data selection step is useful to reduce the requirements of data labeling and
the computational complexity of learning. We assume to be given $N$ unlabeled
samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a
`surrogate model' that can predict labels $y_i$ better than random guessing.
Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol
x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we
use them to train a model via regularized empirical risk minimization.
By using a mixture of numerical experiments on real and synthetic data, and
mathematical derivations under low- and high- dimensional asymptotics, we show
that: $(i)$~Data selection can be very effective, in particular beating
training on the full sample in some cases; $(ii)$~Certain popular choices in
data selection methods (e.g. unbiased reweighted subsampling, or influence
function-based subsampling) can be substantially suboptimal.
Related papers
- Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.
We propose a new learning paradigm that integrates both paired and unpaired data.
Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Better Locally Private Sparse Estimation Given Multiple Samples Per User [2.9562742331218725]
We investigate user-level locally differentially private sparse linear regression.
We show that with $n$ users each contributing $m$ samples, the linear dependency of dimension $d$ can be eliminated.
We propose a framework that first selects candidate variables and then conducts estimation in the narrowed low-dimensional space.
arXiv Detail & Related papers (2024-08-08T08:47:20Z) - Agnostic Active Learning of Single Index Models with Linear Sample Complexity [27.065175036001246]
We study active learning methods for single index models of the form $F(mathbf x) = f(langle mathbf w, mathbf xrangle)$.
arXiv Detail & Related papers (2024-05-15T13:11:28Z) - Data-Efficient Learning via Clustering-Based Sensitivity Sampling:
Foundation Models and Beyond [28.651041302245538]
We present a new data selection approach based on $k$-means clustering and sampling sensitivity.
We show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling.
arXiv Detail & Related papers (2024-02-27T09:03:43Z) - Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression [4.396860522241307]
We show that an efficient learning algorithm for sparse linear regression can be used to solve sparse PCA problems with a negative spike.
We complement our reduction with low-degree and statistical query lower bounds for the sparse problems from which we reduce.
arXiv Detail & Related papers (2024-02-21T19:55:01Z) - Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
Method for Multimodal Contrastive Learning [17.40655778450583]
We propose a principled metric named Variance Alignment Score (VAS), which has the form $langle Sigma_texttest, Sigma_irangle$.
We show that applying VAS and CLIP scores together can outperform baselines by a margin of $1.3%$ on 38 evaluation sets for noisy dataset DataComp and $2.5%$ on VTAB for high-quality dataset CC12M.
arXiv Detail & Related papers (2024-02-03T06:29:04Z) - Project and Probe: Sample-Efficient Domain Adaptation by Interpolating
Orthogonal Features [119.22672589020394]
We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features.
Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$2$ improves performance by 5-15% when given limited target data.
arXiv Detail & Related papers (2023-02-10T18:58:03Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z) - Mediated Uncoupled Learning: Learning Functions without Direct
Input-output Correspondences [80.95776331769899]
We consider the task of predicting $Y$ from $X$ when we have no paired data of them.
A naive approach is to predict $U$ from $X$ using $S_X$ and then $Y$ from $U$ using $S_Y$.
We propose a new method that avoids predicting $U$ but directly learns $Y = f(X)$ by training $f(X)$ with $S_X$ to predict $h(U)$.
arXiv Detail & Related papers (2021-07-16T22:13:29Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.