Related papers: Do Data Valuations Make Good Data Prices?

Do Data Valuations Make Good Data Prices?

URL: http://arxiv.org/abs/2504.05563v2
Date: Fri, 26 Sep 2025 16:21:43 GMT
Title: Do Data Valuations Make Good Data Prices?
Authors: Dongyang Fan, Tyler J. Rotello, Sai Praneeth Karimireddy,
Abstract summary: We revisit data valuations from a $textitmarket-design perspective.<n>We show that popular valuation methods-such as Leave-One-Out and Data Shapley-make for poor payments.<n>We adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves.
Score: 10.526444017990302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models increasingly rely on external data sources, compensating data contributors has become a central concern. But how should these payments be devised? We revisit data valuations from a $\textit{market-design perspective}$ where payments serve to compensate data owners for the $\textit{private}$ heterogeneous costs they incur for collecting and sharing data. We show that popular valuation methods-such as Leave-One-Out and Data Shapley-make for poor payments. They fail to ensure truthful reporting of the costs, leading to $\textit{inefficient market}$ outcomes. To address this, we adapt well-established payment rules from mechanism design, namely Myerson and Vickrey-Clarke-Groves (VCG), to the data market setting. We show that Myerson payment is the minimal truthful mechanism, optimal from the buyer's perspective. Additionally, we identify a condition under which both data buyers and sellers are utility-satisfied, and the market achieves efficiency. Our findings highlight the importance of incorporating incentive compatibility into data valuation design, paving the way for more robust and efficient data markets. Our data market framework is readily applicable to real-world scenarios. We illustrate this with simulations of contributor compensation in an LLM based retrieval-augmented generation (RAG) marketplace tasked with challenging medical question answering.

Related papers

Designing DSIC Mechanisms for Data Sharing in the Era of Large Language Models [0.0]
Training large language models (LLMs) requires vast amounts of high-quality data from institutions that face legal, privacy, and strategic constraints.<n>We introduce a mechanism-design framework for truthful, trust-minimized data sharing.<n>We formalize a model where providers privately know their data cost and quality, and value arises solely from the data's contribution to model performance.
arXiv Detail & Related papers (2025-06-01T22:17:18Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Data Pricing for Graph Neural Networks without Pre-purchased Inspection [15.556650640576311]
Model marketplaces leverage model trading mechanisms to properly incentive data owners to contribute their data. We propose a novel mechanism, named Structural Importance based Model Trading (SIMT) mechanism, that assesses the data importance and compensates data owners accordingly. SIMT consistently outperforms vanilla baselines by up to $40%$ in both MacroF1 and MicroF1.
arXiv Detail & Related papers (2025-02-12T10:42:04Z)
An Instrumental Value for Data Production and its Application to Data Pricing [107.98697414652479]
This paper develops an approach for capturing the instrumental value of data production processes.<n>We show how they connect to classic notions of information design and signals in information economics.
arXiv Detail & Related papers (2024-12-24T03:53:57Z)
Wasserstein Markets for Differentially-Private Data [1.4266656344673316]
Data markets provide a means to enable wider access as well as determine the appropriate privacy-utility trade-off. Existing data market frameworks either require a trusted third party to perform expensive valuations or are unable to capture the nature of data value. This paper proposes a valuation mechanism based on the Wasserstein distance for differentially-private data, and corresponding procurement mechanisms.
arXiv Detail & Related papers (2024-12-03T17:40:26Z)
Pricing Strategies for Different Accuracy Models from the Same Dataset Based on Generalized Hotelling's Law [9.353146025394372]
We consider a scenario where a seller possesses a dataset $D$ and trains it into models of varying accuracies for sale in the market.<n>The dataset can be reused to train models with different accuracies, and the training cost is independent of the sales volume.
arXiv Detail & Related papers (2024-04-08T08:02:18Z)
DAVED: Data Acquisition via Experimental Design for Data Markets [25.300193837833426]
We propose a federated approach to the data acquisition problem that is inspired by linear experimental design. Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
arXiv Detail & Related papers (2024-03-20T18:05:52Z)
A Bargaining-based Approach for Feature Trading in Vertical Federated Learning [54.51890573369637]
We propose a bargaining-based feature trading approach in Vertical Federated Learning (VFL) to encourage economically efficient transactions. Our model incorporates performance gain-based pricing, taking into account the revenue-based optimization objectives of both parties.
arXiv Detail & Related papers (2024-02-23T10:21:07Z)
Privacy-Aware Data Acquisition under Data Similarity in Regression Markets [29.64195175524365]
We show that data similarity and privacy preferences are integral to market design. We numerically evaluate how data similarity affects market participation and traded data value.
arXiv Detail & Related papers (2023-12-05T09:39:04Z)
Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets. We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z)
Optimal Pricing for Data-Augmented AutoML Marketplaces [34.293214013879464]
We propose a pragmatic data-augmented AutoML market that seamlessly integrates with existing cloud-based AutoML platforms.<n>Unlike standard AutoML solutions, our design automatically augments buyer-submitted training data with valuable external datasets.<n>Our key innovation is a pricing mechanism grounded in the instrumental value - the marginal model quality improvement.
arXiv Detail & Related papers (2023-10-27T01:49:13Z)
Addressing Budget Allocation and Revenue Allocation in Data Market Environments Using an Adaptive Sampling Algorithm [14.206050847214652]
We introduce a new algorithm to solve budget allocation and revenue allocation problems simultaneously in linear time. The new algorithm employs an adaptive sampling process that selects data from those providers who are contributing the most to the model. We provide theoretical guarantees for the algorithm that show the budget is used efficiently and the properties of revenue allocation are similar to Shapley's.
arXiv Detail & Related papers (2023-06-05T02:28:19Z)
Mechanisms that Incentivize Data Sharing in Federated Learning [90.74337749137432]
We show how a naive scheme leads to catastrophic levels of free-riding where the benefits of data sharing are completely eroded. We then introduce accuracy shaping based mechanisms to maximize the amount of data generated by each agent.
arXiv Detail & Related papers (2022-07-10T22:36:52Z)
VFed-SSD: Towards Practical Vertical Federated Advertising [53.08038962443853]
We propose a semi-supervised split distillation framework VFed-SSD to alleviate the two limitations. Specifically, we develop a self-supervised task MatchedPair Detection (MPD) to exploit the vertically partitioned unlabeled data. Our framework provides an efficient federation-enhanced solution for real-time display advertising with minimal deploying cost and significant performance lift.
arXiv Detail & Related papers (2022-05-31T17:45:30Z)
Data Sharing Markets [95.13209326119153]
We study a setup where each agent can be both buyer and seller of data. We consider two cases: bilateral data exchange (trading data with data) and unilateral data exchange (trading data with money)
arXiv Detail & Related papers (2021-07-19T06:00:34Z)
A Principled Approach to Data Valuation for Federated Learning [73.19984041333599]
Federated learning (FL) is a popular technique to train machine learning (ML) models on decentralized data sources. The Shapley value (SV) defines a unique payoff scheme that satisfies many desiderata for a data value notion. This paper proposes a variant of the SV amenable to FL, which we call the federated Shapley value.
arXiv Detail & Related papers (2020-09-14T04:37:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.