SelectFormer: Private and Practical Data Selection for Transformers
- URL: http://arxiv.org/abs/2310.02373v4
- Date: Sat, 01 Mar 2025 23:44:44 GMT
- Title: SelectFormer: Private and Practical Data Selection for Transformers
- Authors: Xu Ouyang, Felix Xiaozhu Lin, Yangfeng Ji,
- Abstract summary: This paper makes it practical for the purpose of data selection to be done over Multi-Party Computation (MPC)<n>Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.
- Score: 17.828547661524688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Critical to a free data market is $\textit{private data selection}$, i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.
Related papers
- PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [28.442470930703337]
PRISM is a training-free approach for efficient multimodal data selection.
It uses Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs.
It reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods.
arXiv Detail & Related papers (2025-02-17T18:43:41Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Compute-Constrained Data Selection [77.06528009072967]
We find that many powerful data selection methods are almost never compute-optimal.
For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Data Selection via Optimal Control for Language Models [134.67665351539725]
This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage.
We introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions.
The benefits of PDS extend to 400B models trained on 10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws.
arXiv Detail & Related papers (2024-10-09T17:06:57Z) - CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning [0.0]
We propose CHG Shapley, which approximates the utility of each data subset on model accuracy during a single model training.
We employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data.
arXiv Detail & Related papers (2024-06-17T16:48:31Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Cost-Effective Retraining of Machine Learning Models [2.9461360639852914]
It is important to retrain a machine learning (ML) model in order to maintain its performance as the data changes over time.
This creates a trade-off between retraining too frequently, which leads to unnecessary computing costs, and not retraining often enough.
We propose ML systems that make automated and cost-effective decisions about when to retrain an ML model.
arXiv Detail & Related papers (2023-10-06T13:02:29Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - Estimating Task Completion Times for Network Rollouts using Statistical
Models within Partitioning-based Regression Methods [0.01841601464419306]
This paper proposes a data and Machine Learning-based forecasting solution for the Telecommunications network-rollout planning problem.
Using historical data of milestone completion times, a model needs to incorporate domain knowledge, handle noise and yet be interpretable to project managers.
This paper proposes partition-based regression models that incorporate data-driven statistical models within each partition, as a solution to the problem.
arXiv Detail & Related papers (2022-11-20T04:28:12Z) - A Marketplace for Trading AI Models based on Blockchain and Incentives
for IoT Data [24.847898465750667]
An emerging paradigm in Machine Learning (ML) is a federated approach where the learning model is delivered to a group of heterogeneous agents partially, allowing agents to train the model locally with their own data.
The problem of valuation of models, as well as the questions of incentives for collaborative training and trading of data/models, have received limited treatment in the literature.
In this paper, a new ecosystem of ML model trading over a trusted ML-based network is proposed. The buyer can acquire the model of interest from the ML market, and interested sellers spend local computations on their data to enhance that model's quality
arXiv Detail & Related papers (2021-12-06T08:52:42Z) - On Effective Scheduling of Model-based Reinforcement Learning [53.027698625496015]
We propose a framework named AutoMBPO to automatically schedule the real data ratio.
In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance.
arXiv Detail & Related papers (2021-11-16T15:24:59Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - A Scalable MIP-based Method for Learning Optimal Multivariate Decision
Trees [17.152864798265455]
We propose a novel MIP formulation, based on a 1-norm support vector machine model, to train a multivariate ODT for classification problems.
We provide cutting plane techniques that tighten the linear relaxation of the MIP formulation, in order to improve run times to reach optimality.
We demonstrate that our formulation outperforms its counterparts in the literature by an average of about 10% in terms of mean out-of-sample testing accuracy.
arXiv Detail & Related papers (2020-11-06T14:17:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.