Data-Efficient Learning via Clustering-Based Sensitivity Sampling:
Foundation Models and Beyond
- URL: http://arxiv.org/abs/2402.17327v1
- Date: Tue, 27 Feb 2024 09:03:43 GMT
- Title: Data-Efficient Learning via Clustering-Based Sensitivity Sampling:
Foundation Models and Beyond
- Authors: Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome,
Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder
- Abstract summary: We present a new data selection approach based on $k$-means clustering and sampling sensitivity.
We show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling.
- Score: 28.651041302245538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the data selection problem, whose aim is to select a small
representative subset of data that can be used to efficiently train a machine
learning model. We present a new data selection approach based on $k$-means
clustering and sensitivity sampling. Assuming access to an embedding
representation of the data with respect to which the model loss is H\"older
continuous, our approach provably allows selecting a set of ``typical'' $k +
1/\varepsilon^2$ elements whose average loss corresponds to the average loss of
the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an
additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means
cost for the input embeddings and $\lambda$ is the H\"older constant.
We furthermore demonstrate the performance and scalability of our approach on
fine-tuning foundation models and show that it outperforms state-of-the-art
methods. We also show how it can be applied on linear regression, leading to a
new sampling strategy that surprisingly matches the performances of leverage
score sampling, while being conceptually simpler and more scalable.
Related papers
- Turnstile $\ell_p$ leverage score sampling with applications [56.403488578703865]
We develop a novel algorithm for sampling rows $a_i$ of a matrix $AinmathbbRntimes d$, proportional to their $ell_p$ norm, when $A$ is presented in a turnstile data stream.
Our algorithm not only returns the set of sampled row indexes, it also returns slightly perturbed rows $tildea_i approx a_i$, and approximates their sampling probabilities up to $varepsilon$ relative error.
For logistic regression, our framework yields the first algorithm that achieves a $
arXiv Detail & Related papers (2024-06-01T07:33:41Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
Method for Multimodal Contrastive Learning [17.40655778450583]
We propose a principled metric named Variance Alignment Score (VAS), which has the form $langle Sigma_texttest, Sigma_irangle$.
We show that applying VAS and CLIP scores together can outperform baselines by a margin of $1.3%$ on 38 evaluation sets for noisy dataset DataComp and $2.5%$ on VTAB for high-quality dataset CC12M.
arXiv Detail & Related papers (2024-02-03T06:29:04Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Improved Active Learning via Dependent Leverage Score Sampling [8.400581768343804]
We show how to obtain improved active learning methods in the agnostic (adversarial noise) setting.
We propose an easily implemented method based on the emphpivotal sampling algorithm
In comparison to independent sampling, our method reduces the number of samples needed to reach a given target accuracy by up to $50%$.
arXiv Detail & Related papers (2023-10-08T01:51:30Z) - Towards a statistical theory of data selection under weak supervision [7.540077751816086]
Given a sample of size $N$, it is often useful to select a subsample of smaller size $nN$ to be used for statistical estimation or learning.
We assume to be given $N$ unlabeled samples $bold x_i_ile N$, and to be given access to a surrogate model' that can predict labels $y_i$ better than random guessing.
arXiv Detail & Related papers (2023-09-25T22:23:27Z) - A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data [0.0]
We introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model.
The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression.
arXiv Detail & Related papers (2023-02-06T23:34:51Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - Optimal Sampling Gaps for Adaptive Submodular Maximization [28.24164217929491]
We study the performance loss caused by probability sampling in the context of adaptive submodular.
We show that the property of policywise submodular can be found in a wide range of real-world applications.
arXiv Detail & Related papers (2021-04-05T03:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.