Related papers: Towards a statistical theory of data selection under weak supervision

Towards a statistical theory of data selection under weak supervision

URL: http://arxiv.org/abs/2309.14563v2
Date: Wed, 4 Oct 2023 04:39:34 GMT
Title: Towards a statistical theory of data selection under weak supervision
Authors: Germain Kolossov, Andrea Montanari, Pulkit Tandon
Abstract summary: Given a sample of size $N$, it is often useful to select a subsample of smaller size $nN$ to be used for statistical estimation or learning. We assume to be given $N$ unlabeled samples $bold x_i_ile N$, and to be given access to a surrogate model' that can predict labels $y_i$ better than random guessing.
Score: 7.540077751816086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given a sample of size $N$, it is often useful to select a subsample of smaller size $n<N$ to be used for statistical estimation or learning. Such a data selection step is useful to reduce the requirements of data labeling and the computational complexity of learning. We assume to be given $N$ unlabeled samples $\{{\boldsymbol x}_i\}_{i\le N}$, and to be given access to a `surrogate model' that can predict labels $y_i$ better than random guessing. Our goal is to select a subset of the samples, to be denoted by $\{{\boldsymbol x}_i\}_{i\in G}$, of size $|G|=n<N$. We then acquire labels for this set and we use them to train a model via regularized empirical risk minimization. By using a mixture of numerical experiments on real and synthetic data, and mathematical derivations under low- and high- dimensional asymptotics, we show that: $(i)$~Data selection can be very effective, in particular beating training on the full sample in some cases; $(ii)$~Certain popular choices in data selection methods (e.g. unbiased reweighted subsampling, or influence function-based subsampling) can be substantially suboptimal.

Related papers

Supervised Models Can Generalize Also When Trained on Random Labels [11.11462289882034]
unsupervised learning raises the question of whether also supervised models can be trained without using the information in the output $y$.<n>We formulate the model as a smoother, i.e. on the form $hatf=Sy$, and to construct the smoother matrix $S$ independently of $y$.<n>We demonstrate that $y$-free trained versions of linear and kernel ridge regression, smoothing splines, and neural networks perform similarly to their standard, $y$-based, versions and, most importantly, significantly better than random guessing.
arXiv Detail & Related papers (2025-05-16T08:51:44Z)
Data Selection for ERMs [67.57726352698933]
We study how well can $mathcalA$ perform when trained on at most $nll N$ data points selected from a population of $N$ points. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression.
arXiv Detail & Related papers (2025-04-20T11:26:01Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning. We propose a new learning paradigm that integrates both paired and unpaired data. Our approach also connects intriguingly with inverse entropic optimal transport (OT)
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Better Locally Private Sparse Estimation Given Multiple Samples Per User [2.9562742331218725]
We investigate user-level locally differentially private sparse linear regression. We show that with $n$ users each contributing $m$ samples, the linear dependency of dimension $d$ can be eliminated. We propose a framework that first selects candidate variables and then conducts estimation in the narrowed low-dimensional space.
arXiv Detail & Related papers (2024-08-08T08:47:20Z)
Agnostic Active Learning of Single Index Models with Linear Sample Complexity [27.065175036001246]
We study active learning methods for single index models of the form $F(mathbf x) = f(langle mathbf w, mathbf xrangle)$.
arXiv Detail & Related papers (2024-05-15T13:11:28Z)
Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond [28.651041302245538]
We present a new data selection approach based on $k$-means clustering and sampling sensitivity. We show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling.
arXiv Detail & Related papers (2024-02-27T09:03:43Z)
Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression [4.396860522241307]
We show that an efficient learning algorithm for sparse linear regression can be used to solve sparse PCA problems with a negative spike. We complement our reduction with low-degree and statistical query lower bounds for the sparse problems from which we reduce.
arXiv Detail & Related papers (2024-02-21T19:55:01Z)
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning [17.40655778450583]
We propose a principled metric named Variance Alignment Score (VAS), which has the form $langle Sigma_texttest, Sigma_irangle$. We show that applying VAS and CLIP scores together can outperform baselines by a margin of $1.3%$ on 38 evaluation sets for noisy dataset DataComp and $2.5%$ on VTAB for high-quality dataset CC12M.
arXiv Detail & Related papers (2024-02-03T06:29:04Z)
Project and Probe: Sample-Efficient Domain Adaptation by Interpolating Orthogonal Features [119.22672589020394]
We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features. Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$2$ improves performance by 5-15% when given limited target data.
arXiv Detail & Related papers (2023-02-10T18:58:03Z)
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking. Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle. We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z)
Mediated Uncoupled Learning: Learning Functions without Direct Input-output Correspondences [80.95776331769899]
We consider the task of predicting $Y$ from $X$ when we have no paired data of them. A naive approach is to predict $U$ from $X$ using $S_X$ and then $Y$ from $U$ using $S_Y$. We propose a new method that avoids predicting $U$ but directly learns $Y = f(X)$ by training $f(X)$ with $S_X$ to predict $h(U)$.
arXiv Detail & Related papers (2021-07-16T22:13:29Z)
Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.