How to Sell High-Dimensional Data Optimally
- URL: http://arxiv.org/abs/2510.15214v1
- Date: Fri, 17 Oct 2025 00:49:03 GMT
- Title: How to Sell High-Dimensional Data Optimally
- Authors: Andrew Li, R. Ravi, Karan Singh, Zihong Yi, Weizhong Zhang,
- Abstract summary: We consider an information pricing problem that involves a decision-making buyer and a monopolistic seller.<n>Since the buyer gains greater utility through better decisions resulting from more accurate assessments of the state, the seller can therefore promise the supplemental information at a price.<n>We propose an algorithm which, given only sampling access to the state space, provably generates a near-optimal menu with a number of samples independent of the state space.
- Score: 31.69704731506027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the problem of selling large, proprietary data, we consider an information pricing problem proposed by Bergemann et al. that involves a decision-making buyer and a monopolistic seller. The seller has access to the underlying state of the world that determines the utility of the various actions the buyer may take. Since the buyer gains greater utility through better decisions resulting from more accurate assessments of the state, the seller can therefore promise the buyer supplemental information at a price. To contend with the fact that the seller may not be perfectly informed about the buyer's private preferences (or utility), we frame the problem of designing a data product as one where the seller designs a revenue-maximizing menu of statistical experiments. Prior work by Cai et al. showed that an optimal menu can be found in time polynomial in the state space, whereas we observe that the state space is naturally exponential in the dimension of the data. We propose an algorithm which, given only sampling access to the state space, provably generates a near-optimal menu with a number of samples independent of the state space. We then analyze a special case of high-dimensional Gaussian data, showing that (a) it suffices to consider scalar Gaussian experiments, (b) the optimal menu of such experiments can be found efficiently via a semidefinite program, and (c) full surplus extraction occurs if and only if a natural separation condition holds on the set of potential preferences of the buyer.
Related papers
- Calibrating an Imperfect Auxiliary Predictor for Unobserved No-Purchase Choice [1.5484595752241122]
Firms typically cannot observe key consumer actions: whether customers buy from a competitor, choose not to buy, or even fully consider the firm's offer.<n>This missing outside-option information makes market-size and preference estimation difficult even in simple multinomial logit (MNL) models.<n>We study a complementary setting in which a black-box auxiliary predictor provides outside-option probabilities, but is potentially biased or miscalibrated because it was trained in a different channel, period, or population.<n>We develop calibration methods that turn such imperfect predictions into statistically valid no-purchase estimates using purchase-only data from the focal
arXiv Detail & Related papers (2026-02-12T03:00:36Z) - An Instrumental Value for Data Production and its Application to Data Pricing [107.98697414652479]
This paper develops an approach for capturing the instrumental value of data production processes.<n>We show how they connect to classic notions of information design and signals in information economics.
arXiv Detail & Related papers (2024-12-24T03:53:57Z) - Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Data Market Design through Deep Learning [16.505791601397185]
We introduce the application of deep learning for the design of revenue-optimal data markets.
Our experiments demonstrate that this new deep learning framework can almost precisely replicate all known solutions from theory.
arXiv Detail & Related papers (2023-10-31T00:21:09Z) - Striking a Balance: An Optimal Mechanism Design for Heterogenous Differentially Private Data Acquisition for Logistic Regression [7.523820334642733]
We address the challenge of solving machine learning tasks using data from privacy-sensitive sellers.
Since the data is private, we design a data market that incentivizes sellers to provide their data in exchange for payments.
arXiv Detail & Related papers (2023-09-19T05:51:13Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - Fundamentals of Task-Agnostic Data Valuation [21.78555506720078]
We study valuing the data of a data owner/seller for a data seeker/buyer.
We focus on task-agnostic data valuation without any validation requirements.
arXiv Detail & Related papers (2022-08-25T22:07:07Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z) - Spatial Privacy Pricing: The Interplay between Privacy, Utility and
Price in Geo-Marketplaces [14.466602643062142]
Users concerned about privacy may want to charge more for data that pinpoints their location accurately, but may charge less for data that is more vague.
A buyer would prefer to minimize data costs, but may have to spend more to get the necessary level of accuracy.
We call this interplay between privacy, utility, and price emphspatial privacy pricing.
arXiv Detail & Related papers (2020-08-25T06:28:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.