Related papers: Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

URL: http://arxiv.org/abs/2402.17327v1
Date: Tue, 27 Feb 2024 09:03:43 GMT
Title: Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond
Authors: Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder
Abstract summary: We present a new data selection approach based on $k$-means clustering and sampling sensitivity. We show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling.
Score: 28.651041302245538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input embeddings and $\lambda$ is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.

Related papers

Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means [4.189643331553922]
The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavy-tailed data.<n>We prove a new sample complexity bound using a novel symmetrization technique that may be of independent interest.
arXiv Detail & Related papers (2025-06-17T16:07:36Z)
SAND: One-Shot Feature Selection with Additive Noise Distortion [3.5976830118932583]
We introduce a novel, non-intrusive feature selection layer that automatically identifies and selects the $k$ most informative features during neural network training.<n>Our method is uniquely simple, requiring no alterations to the loss function, network architecture, or post-selection retraining.<n>Our work demonstrates that simplicity and performance are not mutually exclusive, offering a powerful yet straightforward tool for feature selection in machine learning.
arXiv Detail & Related papers (2025-05-06T18:59:35Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Turnstile $\ell_p$ leverage score sampling with applications [56.403488578703865]
We develop a novel algorithm for sampling rows $a_i$ of a matrix $AinmathbbRntimes d$, proportional to their $ell_p$ norm, when $A$ is presented in a turnstile data stream. Our algorithm not only returns the set of sampled row indexes, it also returns slightly perturbed rows $tildea_i approx a_i$, and approximates their sampling probabilities up to $varepsilon$ relative error. For logistic regression, our framework yields the first algorithm that achieves a $
arXiv Detail & Related papers (2024-06-01T07:33:41Z)
Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure. We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z)
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning [17.40655778450583]
We propose a principled metric named Variance Alignment Score (VAS), which has the form $langle Sigma_texttest, Sigma_irangle$. We show that applying VAS and CLIP scores together can outperform baselines by a margin of $1.3%$ on 38 evaluation sets for noisy dataset DataComp and $2.5%$ on VTAB for high-quality dataset CC12M.
arXiv Detail & Related papers (2024-02-03T06:29:04Z)
Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL) We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking. We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z)
Improved Active Learning via Dependent Leverage Score Sampling [8.400581768343804]
We show how to obtain improved active learning methods in the agnostic (adversarial noise) setting. We propose an easily implemented method based on the emphpivotal sampling algorithm In comparison to independent sampling, our method reduces the number of samples needed to reach a given target accuracy by up to $50%$.
arXiv Detail & Related papers (2023-10-08T01:51:30Z)
Towards a statistical theory of data selection under weak supervision [7.540077751816086]
Given a sample of size $N$, it is often useful to select a subsample of smaller size $nN$ to be used for statistical estimation or learning. We assume to be given $N$ unlabeled samples $bold x_i_ile N$, and to be given access to a surrogate model' that can predict labels $y_i$ better than random guessing.
arXiv Detail & Related papers (2023-09-25T22:23:27Z)
A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data [0.0]
We introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model. The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression.
arXiv Detail & Related papers (2023-02-06T23:34:51Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking. Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z)
Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z)
Optimal Sampling Gaps for Adaptive Submodular Maximization [28.24164217929491]
We study the performance loss caused by probability sampling in the context of adaptive submodular. We show that the property of policywise submodular can be found in a wide range of real-world applications.
arXiv Detail & Related papers (2021-04-05T03:21:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.