Related papers: An Investigation of Smart Contract for Collaborative Machine Learning Model Training

An Investigation of Smart Contract for Collaborative Machine Learning Model Training

URL: http://arxiv.org/abs/2209.05017v1
Date: Mon, 12 Sep 2022 04:25:01 GMT
Title: An Investigation of Smart Contract for Collaborative Machine Learning Model Training
Authors: Shengwen Ding, Chenhui Hu
Abstract summary: Collaborative machine learning (CML) has penetrated various fields in the era of big data. As the training of ML models requires a massive amount of good quality data, it is necessary to eliminate concerns about data privacy. Based on blockchain, smart contracts enable automatic execution of data preserving and validation.
Score: 3.5679973993372642
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Machine learning (ML) has penetrated various fields in the era of big data. The advantage of collaborative machine learning (CML) over most conventional ML lies in the joint effort of decentralized nodes or agents that results in better model performance and generalization. As the training of ML models requires a massive amount of good quality data, it is necessary to eliminate concerns about data privacy and ensure high-quality data. To solve this problem, we cast our eyes on the integration of CML and smart contracts. Based on blockchain, smart contracts enable automatic execution of data preserving and validation, as well as the continuity of CML model training. In our simulation experiments, we define incentive mechanisms on the smart contract, investigate the important factors such as the number of features in the dataset (num_words), the size of the training data, the cost for the data holders to submit data, etc., and conclude how these factors impact the performance metrics of the model: the accuracy of the trained model, the gap between the accuracies of the model before and after simulation, and the time to use up the balance of bad agent. For instance, the increase of the value of num_words leads to higher model accuracy and eliminates the negative influence of malicious agents in a shorter time from our observation of the experiment results. Statistical analyses show that with the help of smart contracts, the influence of invalid data is efficiently diminished and model robustness is maintained. We also discuss the gap in existing research and put forward possible future directions for further works.

Related papers

Privacy-Preserving Methods for Bug Severity Prediction [0.0]
We investigate method-level bug severity prediction using source code metrics and Large Language Models.<n>We compare the performance of models trained using centralized learning, federated learning, and synthetic data generation.<n>Our finding highlights the potential of privacy-preserving approaches to enable effective bug severity prediction in industrial context.
arXiv Detail & Related papers (2025-06-28T04:40:51Z)
Model State Arithmetic for Machine Unlearning [43.773053236733425]
We propose a new algorithm, MSA, for estimating and undoing the influence of datapoints.<n>Our experimental results demonstrate that MSA consistently outperforms existing machine unlearning algorithms.
arXiv Detail & Related papers (2025-06-26T02:16:16Z)
EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding [50.29046178980637]
EpiCoDe is a method that boosts model performance in data-scarcity scenarios without extra training.<n>We show that EpiCoDe consistently outperforms existing methods with significant and robust improvement.
arXiv Detail & Related papers (2025-06-04T02:11:54Z)
Aligning Language Models with Observational Data: Opportunities and Risks from a Causal Perspective [0.0]
We study the challenges and opportunities of fine-tuning large language models using observational data.<n>We show that while observational outcomes can provide valuable supervision, directly fine-tuning models on such data can lead them to learn spurious correlations.<n>We propose DeconfoundLM, a method that explicitly removes the effect of known confounders from reward signals.
arXiv Detail & Related papers (2025-05-30T18:44:09Z)
PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare [2.1046377530356764]
Fine-tuning of Large Language Models (LLMs) has become the default practice for improving model performance on a given task. PatientDx is a framework of model merging that allows the design of effective LLMs for health-predictive tasks without requiring fine-tuning nor adaptation on patient data.
arXiv Detail & Related papers (2025-04-24T08:21:04Z)
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement. We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z)
MetaTrading: An Immersion-Aware Model Trading Framework for Vehicular Metaverse Services [94.61039892220037]
We propose an immersion-aware model trading framework that facilitates data provision for services while ensuring privacy through federated learning (FL) We design an incentive mechanism to incentivize metaverse users (MUs) to contribute high-value models under resource constraints. We develop a fully distributed dynamic reward algorithm based on deep reinforcement learning, without accessing any private information about MUs and other MSPs.
arXiv Detail & Related papers (2024-10-25T16:20:46Z)
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms. We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law. Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z)
Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT) CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction. We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z)
Towards Better Modeling with Missing Data: A Contrastive Learning-based Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling. Current approaches are categorized into feature imputation and label prediction. This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z)
Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16 [0.29998889086656577]
We show that relatively minor modifications on a benchmark dataset cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies.
arXiv Detail & Related papers (2023-05-31T12:03:12Z)
Exploring the Vulnerabilities of Machine Learning and Quantum Machine Learning to Adversarial Attacks using a Malware Dataset: A Comparative Analysis [0.0]
Machine learning (ML) and quantum machine learning (QML) have shown remarkable potential in tackling complex problems. Their susceptibility to adversarial attacks raises concerns when deploying these systems in security sensitive applications. We present a comparative analysis of the vulnerability of ML and QNN models to adversarial attacks using a malware dataset.
arXiv Detail & Related papers (2023-05-31T06:31:42Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Striving for data-model efficiency: Identifying data externalities on group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population. Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z)
Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions [59.284907093349425]
Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. We provide a language for describing how training data influences predictions, through a causal framework. Our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone.
arXiv Detail & Related papers (2022-07-28T17:36:24Z)
CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships [8.679073301435265]
We construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. We use these labels to perturb the data by deleting non-causal agents from the scene. Under non-causal perturbations, we observe a $25$-$38%$ relative change in minADE as compared to the original.
arXiv Detail & Related papers (2022-07-07T21:28:23Z)
A Unified Framework for Task-Driven Data Quality Management [10.092524512413831]
High-quality data is critical to train performant Machine Learning (ML) models. Existing Data Quality Management schemes cannot satisfactorily improve ML performance. We propose a task-driven, model-agnostic DQM framework, DataSifter.
arXiv Detail & Related papers (2021-06-10T03:56:28Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
Data and Model Dependencies of Membership Inference Attack [13.951470844348899]
We provide an empirical analysis of the impact of both the data and ML model properties on the vulnerability of ML techniques to MIA. Our results reveal the relationship between MIA accuracy and properties of the dataset and training model in use. We propose using those data and model properties as regularizers to protect ML models against MIA.
arXiv Detail & Related papers (2020-02-17T09:35:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.