Incentivizing Collaboration in Machine Learning via Synthetic Data
Rewards
- URL: http://arxiv.org/abs/2112.09327v1
- Date: Fri, 17 Dec 2021 05:15:30 GMT
- Title: Incentivizing Collaboration in Machine Learning via Synthetic Data
Rewards
- Authors: Sebastian Shenghong Tay and Xinyi Xu and Chuan Sheng Foo and Bryan
Kian Hsiang Low
- Abstract summary: This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data.
Distributing synthetic data as rewards offers task- and model-agnostic benefits for downstream learning tasks.
- Score: 26.850070556844628
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents a novel collaborative generative modeling (CGM) framework
that incentivizes collaboration among self-interested parties to contribute
data to a pool for training a generative model (e.g., GAN), from which
synthetic data are drawn and distributed to the parties as rewards commensurate
to their contributions. Distributing synthetic data as rewards (instead of
trained models or money) offers task- and model-agnostic benefits for
downstream learning tasks and is less likely to violate data privacy
regulation. To realize the framework, we firstly propose a data valuation
function using maximum mean discrepancy (MMD) that values data based on its
quantity and quality in terms of its closeness to the true data distribution
and provide theoretical results guiding the kernel choice in our MMD-based data
valuation function. Then, we formulate the reward scheme as a linear
optimization problem that when solved, guarantees certain incentives such as
fairness in the CGM framework. We devise a weighted sampling algorithm for
generating synthetic data to be distributed to each party as reward such that
the value of its data and the synthetic data combined matches its assigned
reward value by the reward scheme. We empirically show using simulated and
real-world datasets that the parties' synthetic data rewards are commensurate
to their contributions.
Related papers
- Mechanisms for Data Sharing in Collaborative Causal Inference (Extended Version) [2.709511652792003]
This paper devises an evaluation scheme to measure the value of each party's data contribution to the common learning task.
It can be leveraged to reward agents fairly, according to the quality of their data, or to maximize all agents' data contributions.
arXiv Detail & Related papers (2024-07-04T14:32:32Z) - IMFL-AIGC: Incentive Mechanism Design for Federated Learning Empowered by Artificial Intelligence Generated Content [15.620004060097155]
Federated learning (FL) has emerged as a promising paradigm that enables clients to collaboratively train a shared global model without uploading their local data.
We propose a data quality-aware incentive mechanism to encourage clients' participation.
Our proposed mechanism exhibits highest training accuracy and reduces up to 53.34% of the server's cost with real-world datasets.
arXiv Detail & Related papers (2024-06-12T07:47:22Z) - Incentives in Private Collaborative Machine Learning [56.84263918489519]
Collaborative machine learning involves training models on data from multiple parties.
We introduce differential privacy (DP) as an incentive.
We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets.
arXiv Detail & Related papers (2024-04-02T06:28:22Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Reward-Directed Conditional Diffusion: Provable Distribution Estimation
and Reward Improvement [42.45888600367566]
Directed generation aims to generate samples with desired properties as measured by a reward function.
We consider the common learning scenario where the data set consists of unlabeled data along with a smaller set of data with noisy reward labels.
arXiv Detail & Related papers (2023-07-13T20:20:40Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Mechanisms that Incentivize Data Sharing in Federated Learning [90.74337749137432]
We show how a naive scheme leads to catastrophic levels of free-riding where the benefits of data sharing are completely eroded.
We then introduce accuracy shaping based mechanisms to maximize the amount of data generated by each agent.
arXiv Detail & Related papers (2022-07-10T22:36:52Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Collaborative Machine Learning with Incentive-Aware Model Rewards [32.43927226170119]
Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models by training on the aggregated data from many parties.
These parties are only willing to share their data when given enough incentives, such as a guaranteed fair reward based on their contributions.
This paper proposes to value a party's reward based on Shapley value and information gain on model parameters given its data.
arXiv Detail & Related papers (2020-10-24T06:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.