Secure and Effective Data Appraisal for Machine Learning
- URL: http://arxiv.org/abs/2310.02373v3
- Date: Wed, 24 Jan 2024 22:02:53 GMT
- Title: Secure and Effective Data Appraisal for Machine Learning
- Authors: Xu Ouyang, Changhong Yang, Felix Xiaozhu Lin, Yangfeng Ji
- Abstract summary: This paper introduces an innovative approach that renders data selection practical.
The proposed method is assessed across an array of Transformer models and NLP/CV benchmarks.
In comparison to the direct MPC-based evaluation of the target model, our approach substantially reduces the time required, from thousands of hours to mere tens of hours, with only a nominal 0.20% dip in accuracy when training with the selected data.
- Score: 17.828547661524688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Essential for an unfettered data market is the ability to discreetly select
and evaluate training data before finalizing a transaction between the data
owner and model owner. To safeguard the privacy of both data and model, this
process involves scrutinizing the target model through Multi-Party Computation
(MPC). While prior research has posited that the MPC-based evaluation of
Transformer models is excessively resource-intensive, this paper introduces an
innovative approach that renders data selection practical. The contributions of
this study encompass three pivotal elements: (1) a groundbreaking pipeline for
confidential data selection using MPC, (2) replicating intricate
high-dimensional operations with simplified low-dimensional MLPs trained on a
limited subset of pertinent data, and (3) implementing MPC in a concurrent,
multi-phase manner. The proposed method is assessed across an array of
Transformer models and NLP/CV benchmarks. In comparison to the direct MPC-based
evaluation of the target model, our approach substantially reduces the time
required, from thousands of hours to mere tens of hours, with only a nominal
0.20% dip in accuracy when training with the selected data.
Related papers
- CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning [0.0]
We propose CHG Shapley, which approximates the utility of each data subset on model accuracy during a single model training.
We employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data.
arXiv Detail & Related papers (2024-06-17T16:48:31Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - An adaptive human-in-the-loop approach to emission detection of Additive
Manufacturing processes and active learning with computer vision [76.72662577101988]
In-situ monitoring and process control in Additive Manufacturing (AM) allows the collection of large amounts of emission data.
This data can be used as input into 3D and 2D representations of the 3D-printed parts.
The aim of this paper is to propose an adaptive human-in-the-loop approach using Machine Learning techniques.
arXiv Detail & Related papers (2022-12-12T15:11:18Z) - Estimating Task Completion Times for Network Rollouts using Statistical
Models within Partitioning-based Regression Methods [0.01841601464419306]
This paper proposes a data and Machine Learning-based forecasting solution for the Telecommunications network-rollout planning problem.
Using historical data of milestone completion times, a model needs to incorporate domain knowledge, handle noise and yet be interpretable to project managers.
This paper proposes partition-based regression models that incorporate data-driven statistical models within each partition, as a solution to the problem.
arXiv Detail & Related papers (2022-11-20T04:28:12Z) - A Marketplace for Trading AI Models based on Blockchain and Incentives
for IoT Data [24.847898465750667]
An emerging paradigm in Machine Learning (ML) is a federated approach where the learning model is delivered to a group of heterogeneous agents partially, allowing agents to train the model locally with their own data.
The problem of valuation of models, as well as the questions of incentives for collaborative training and trading of data/models, have received limited treatment in the literature.
In this paper, a new ecosystem of ML model trading over a trusted ML-based network is proposed. The buyer can acquire the model of interest from the ML market, and interested sellers spend local computations on their data to enhance that model's quality
arXiv Detail & Related papers (2021-12-06T08:52:42Z) - On Effective Scheduling of Model-based Reinforcement Learning [53.027698625496015]
We propose a framework named AutoMBPO to automatically schedule the real data ratio.
In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance.
arXiv Detail & Related papers (2021-11-16T15:24:59Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - A Scalable MIP-based Method for Learning Optimal Multivariate Decision
Trees [17.152864798265455]
We propose a novel MIP formulation, based on a 1-norm support vector machine model, to train a multivariate ODT for classification problems.
We provide cutting plane techniques that tighten the linear relaxation of the MIP formulation, in order to improve run times to reach optimality.
We demonstrate that our formulation outperforms its counterparts in the literature by an average of about 10% in terms of mean out-of-sample testing accuracy.
arXiv Detail & Related papers (2020-11-06T14:17:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.