Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models
- URL: http://arxiv.org/abs/2506.05379v1
- Date: Sun, 01 Jun 2025 22:17:18 GMT
- Title: Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models
- Authors: Seyed Moein Ayyoubzadeh, Kourosh Shahnazari, Mohammmadali Keshtparvar, MohammadAmin Fazli,
- Abstract summary: Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints.<n>We introduce a mechanism-design framework for truthful, trust-minimized data sharing.<n>We formalize a model where providers privately know their data cost and quality, and value arises solely from the data's contribution to model performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints. Existing data procurement methods often rely on unverifiable trust or ignore heterogeneous provider costs. We introduce a mechanism-design framework for truthful, trust-minimized data sharing that ensures dominant-strategy incentive compatibility (DSIC), individual rationality, and weak budget balance, while rewarding data based on both quality and learning utility. We formalize a model where providers privately know their data cost and quality, and value arises solely from the data's contribution to model performance. Based on this, we propose the Quality-Weighted Marginal-Incentive Auction (Q-MIA), which ranks providers using a virtual cost metric and uses Myerson-style payments to ensure DSIC and budget feasibility. To support settings with limited liquidity or long-term incentives, we introduce the Marginal Utility Token (MUT), which allocates future rights based on marginal contributions. We unify these in Mixed-MIA, a hybrid mechanism balancing upfront payments and deferred rewards. All mechanisms support verifiable, privacy-preserving implementation. Theoretically and empirically, they outperform volume-based and trust-based baselines, eliciting higher-quality data under budget constraints while remaining robust to misreporting and collusion. This establishes a principled foundation for sustainable and fair data markets for future LLMs.
Related papers
- Incentivizing Inclusive Contributions in Model Sharing Markets [47.66231950174746]
This paper proposes inclusive and incentivized personalized federated learning (iPFL)<n>iPFL incentivizes data holders with diverse purposes to collaboratively train personalized models without revealing raw data.<n> Empirical studies on eleven AI tasks demonstrate that iPFL consistently achieves the highest economic utility.
arXiv Detail & Related papers (2025-05-05T08:45:26Z) - Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI [6.671649946926508]
Federated Learning (FL) enables machine learning while preserving data privacy but struggles to balance privacy preservation (PP) and fairness.<n>DP enhances privacy but can disproportionately impact underrepresented groups, while HE and SMC fairness concerns at the cost of computational overhead.<n>Our findings highlight context-dependent trade-offs and offer guidelines for designing FL systems that uphold responsible AI principles, ensuring fairness, privacy, and equitable real-world applications.
arXiv Detail & Related papers (2025-03-20T15:31:01Z) - MetaTrading: An Immersion-Aware Model Trading Framework for Vehicular Metaverse Services [94.61039892220037]
We propose an immersion-aware model trading framework that facilitates data provision for services while ensuring privacy through federated learning (FL)<n>We design an incentive mechanism to incentivize metaverse users (MUs) to contribute high-value models under resource constraints.<n>We develop a fully distributed dynamic reward algorithm based on deep reinforcement learning, without accessing any private information about MUs and other MSPs.
arXiv Detail & Related papers (2024-10-25T16:20:46Z) - IMFL-AIGC: Incentive Mechanism Design for Federated Learning Empowered by Artificial Intelligence Generated Content [15.620004060097155]
Federated learning (FL) has emerged as a promising paradigm that enables clients to collaboratively train a shared global model without uploading their local data.
We propose a data quality-aware incentive mechanism to encourage clients' participation.
Our proposed mechanism exhibits highest training accuracy and reduces up to 53.34% of the server's cost with real-world datasets.
arXiv Detail & Related papers (2024-06-12T07:47:22Z) - Optimal Pricing for Data-Augmented AutoML Marketplaces [34.293214013879464]
We propose a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms.<n>Unlike standard AutoML solutions, our design automatically augments buyer-submitted training data with valuable external datasets.<n>Our key innovation is a pricing mechanism grounded in the instrumental value - the marginal model quality improvement.
arXiv Detail & Related papers (2023-10-27T01:49:13Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Mechanisms that Incentivize Data Sharing in Federated Learning [90.74337749137432]
We show how a naive scheme leads to catastrophic levels of free-riding where the benefits of data sharing are completely eroded.
We then introduce accuracy shaping based mechanisms to maximize the amount of data generated by each agent.
arXiv Detail & Related papers (2022-07-10T22:36:52Z) - Spending Privacy Budget Fairly and Wisely [7.975975942400017]
Differentially private (DP) synthetic data generation is a practical method for improving access to data.
One issue inherent to DP is that the "privacy budget" is generally "spent" evenly across features in the data set.
We develop ensemble methods that distribute the privacy budget "wisely" to maximize predictive accuracy of models trained on DP data.
arXiv Detail & Related papers (2022-04-27T13:13:56Z) - Data Sharing Markets [95.13209326119153]
We study a setup where each agent can be both buyer and seller of data.
We consider two cases: bilateral data exchange (trading data with data) and unilateral data exchange (trading data with money)
arXiv Detail & Related papers (2021-07-19T06:00:34Z) - Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations.
We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.