Addressing Budget Allocation and Revenue Allocation in Data Market
Environments Using an Adaptive Sampling Algorithm
- URL: http://arxiv.org/abs/2306.02543v1
- Date: Mon, 5 Jun 2023 02:28:19 GMT
- Title: Addressing Budget Allocation and Revenue Allocation in Data Market
Environments Using an Adaptive Sampling Algorithm
- Authors: Boxin Zhao, Boxiang Lyu, Raul Castro Fernandez, Mladen Kolar
- Abstract summary: We introduce a new algorithm to solve budget allocation and revenue allocation problems simultaneously in linear time.
The new algorithm employs an adaptive sampling process that selects data from those providers who are contributing the most to the model.
We provide theoretical guarantees for the algorithm that show the budget is used efficiently and the properties of revenue allocation are similar to Shapley's.
- Score: 14.206050847214652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality machine learning models are dependent on access to high-quality
training data. When the data are not already available, it is tedious and
costly to obtain them. Data markets help with identifying valuable training
data: model consumers pay to train a model, the market uses that budget to
identify data and train the model (the budget allocation problem), and finally
the market compensates data providers according to their data contribution
(revenue allocation problem). For example, a bank could pay the data market to
access data from other financial institutions to train a fraud detection model.
Compensating data contributors requires understanding data's contribution to
the model; recent efforts to solve this revenue allocation problem based on the
Shapley value are inefficient to lead to practical data markets.
In this paper, we introduce a new algorithm to solve budget allocation and
revenue allocation problems simultaneously in linear time. The new algorithm
employs an adaptive sampling process that selects data from those providers who
are contributing the most to the model. Better data means that the algorithm
accesses those providers more often, and more frequent accesses corresponds to
higher compensation. Furthermore, the algorithm can be deployed in both
centralized and federated scenarios, boosting its applicability. We provide
theoretical guarantees for the algorithm that show the budget is used
efficiently and the properties of revenue allocation are similar to Shapley's.
Finally, we conduct an empirical evaluation to show the performance of the
algorithm in practical scenarios and when compared to other baselines. Overall,
we believe that the new algorithm paves the way for the implementation of
practical data markets.
Related papers
- Compute-Constrained Data Selection [77.06528009072967]
We formalize the problem of data selection with a cost-aware utility function, and model the problem as trading off initial-selection cost for training gain.
We run a comprehensive sweep of experiments across multiple tasks, varying compute budget by scaling finetuning tokens, model sizes, and data selection compute.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - DAVED: Data Acquisition via Experimental Design for Data Markets [25.300193837833426]
We propose a federated approach to the data acquisition problem that is inspired by linear experimental design.
Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data.
The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
arXiv Detail & Related papers (2024-03-20T18:05:52Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - Preserving Fairness in AI under Domain Shift [15.820660013260584]
Existing algorithms for ensuring fairness in AI use a single-shot training strategy.
We develop an algorithm to adapt a fair model to remain fair under domain shift.
arXiv Detail & Related papers (2023-01-29T06:13:40Z) - Data Budgeting for Machine Learning [17.524791147624086]
We study the data budgeting problem and formulate it as two sub-problems.
We propose a learning method to solve data budgeting problems.
Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as $50$ data points.
arXiv Detail & Related papers (2022-10-03T14:53:17Z) - Self-supervised similarity models based on well-logging data [1.0723143072368782]
We present an approach that provides universal data representations suitable for solutions to different problems for different oil fields.
Our approach relies on the self-supervised methodology for sequential logging data for intervals from well.
We found out that using the variational autoencoder leads to the most reliable and accurate models.
arXiv Detail & Related papers (2022-09-26T06:24:08Z) - Augmented Bilinear Network for Incremental Multi-Stock Time-Series
Classification [83.23129279407271]
We propose a method to efficiently retain the knowledge available in a neural network pre-trained on a set of securities.
In our method, the prior knowledge encoded in a pre-trained neural network is maintained by keeping existing connections fixed.
This knowledge is adjusted for the new securities by a set of augmented connections, which are optimized using the new data.
arXiv Detail & Related papers (2022-07-23T18:54:10Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Data Sharing Markets [95.13209326119153]
We study a setup where each agent can be both buyer and seller of data.
We consider two cases: bilateral data exchange (trading data with data) and unilateral data exchange (trading data with money)
arXiv Detail & Related papers (2021-07-19T06:00:34Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.