An Investigation of Smart Contract for Collaborative Machine Learning
Model Training
- URL: http://arxiv.org/abs/2209.05017v1
- Date: Mon, 12 Sep 2022 04:25:01 GMT
- Title: An Investigation of Smart Contract for Collaborative Machine Learning
Model Training
- Authors: Shengwen Ding, Chenhui Hu
- Abstract summary: Collaborative machine learning (CML) has penetrated various fields in the era of big data.
As the training of ML models requires a massive amount of good quality data, it is necessary to eliminate concerns about data privacy.
Based on blockchain, smart contracts enable automatic execution of data preserving and validation.
- Score: 3.5679973993372642
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine learning (ML) has penetrated various fields in the era of big data.
The advantage of collaborative machine learning (CML) over most conventional ML
lies in the joint effort of decentralized nodes or agents that results in
better model performance and generalization. As the training of ML models
requires a massive amount of good quality data, it is necessary to eliminate
concerns about data privacy and ensure high-quality data. To solve this
problem, we cast our eyes on the integration of CML and smart contracts. Based
on blockchain, smart contracts enable automatic execution of data preserving
and validation, as well as the continuity of CML model training. In our
simulation experiments, we define incentive mechanisms on the smart contract,
investigate the important factors such as the number of features in the dataset
(num_words), the size of the training data, the cost for the data holders to
submit data, etc., and conclude how these factors impact the performance
metrics of the model: the accuracy of the trained model, the gap between the
accuracies of the model before and after simulation, and the time to use up the
balance of bad agent. For instance, the increase of the value of num_words
leads to higher model accuracy and eliminates the negative influence of
malicious agents in a shorter time from our observation of the experiment
results. Statistical analyses show that with the help of smart contracts, the
influence of invalid data is efficiently diminished and model robustness is
maintained. We also discuss the gap in existing research and put forward
possible future directions for further works.
Related papers
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z) - Exploring the Vulnerabilities of Machine Learning and Quantum Machine
Learning to Adversarial Attacks using a Malware Dataset: A Comparative
Analysis [0.0]
Machine learning (ML) and quantum machine learning (QML) have shown remarkable potential in tackling complex problems.
Their susceptibility to adversarial attacks raises concerns when deploying these systems in security sensitive applications.
We present a comparative analysis of the vulnerability of ML and QNN models to adversarial attacks using a malware dataset.
arXiv Detail & Related papers (2023-05-31T06:31:42Z) - AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems.
We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Measuring Causal Effects of Data Statistics on Language Model's
`Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models.
We provide a language for describing how training data influences predictions, through a causal framework.
Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z) - CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal
Relationships [8.679073301435265]
We construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data.
We use these labels to perturb the data by deleting non-causal agents from the scene.
Under non-causal perturbations, we observe a $25$-$38%$ relative change in minADE as compared to the original.
arXiv Detail & Related papers (2022-07-07T21:28:23Z) - A Unified Framework for Task-Driven Data Quality Management [10.092524512413831]
High-quality data is critical to train performant Machine Learning (ML) models.
Existing Data Quality Management schemes cannot satisfactorily improve ML performance.
We propose a task-driven, model-agnostic DQM framework, DataSifter.
arXiv Detail & Related papers (2021-06-10T03:56:28Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z) - Data and Model Dependencies of Membership Inference Attack [13.951470844348899]
We provide an empirical analysis of the impact of both the data and ML model properties on the vulnerability of ML techniques to MIA.
Our results reveal the relationship between MIA accuracy and properties of the dataset and training model in use.
We propose using those data and model properties as regularizers to protect ML models against MIA.
arXiv Detail & Related papers (2020-02-17T09:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.