Addressing Budget Allocation and Revenue Allocation in Data Market
Environments Using an Adaptive Sampling Algorithm
- URL: http://arxiv.org/abs/2306.02543v1
- Date: Mon, 5 Jun 2023 02:28:19 GMT
- Title: Addressing Budget Allocation and Revenue Allocation in Data Market
Environments Using an Adaptive Sampling Algorithm
- Authors: Boxin Zhao, Boxiang Lyu, Raul Castro Fernandez, Mladen Kolar
- Abstract summary: We introduce a new algorithm to solve budget allocation and revenue allocation problems simultaneously in linear time.
The new algorithm employs an adaptive sampling process that selects data from those providers who are contributing the most to the model.
We provide theoretical guarantees for the algorithm that show the budget is used efficiently and the properties of revenue allocation are similar to Shapley's.
- Score: 14.206050847214652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality machine learning models are dependent on access to high-quality
training data. When the data are not already available, it is tedious and
costly to obtain them. Data markets help with identifying valuable training
data: model consumers pay to train a model, the market uses that budget to
identify data and train the model (the budget allocation problem), and finally
the market compensates data providers according to their data contribution
(revenue allocation problem). For example, a bank could pay the data market to
access data from other financial institutions to train a fraud detection model.
Compensating data contributors requires understanding data's contribution to
the model; recent efforts to solve this revenue allocation problem based on the
Shapley value are inefficient to lead to practical data markets.
In this paper, we introduce a new algorithm to solve budget allocation and
revenue allocation problems simultaneously in linear time. The new algorithm
employs an adaptive sampling process that selects data from those providers who
are contributing the most to the model. Better data means that the algorithm
accesses those providers more often, and more frequent accesses corresponds to
higher compensation. Furthermore, the algorithm can be deployed in both
centralized and federated scenarios, boosting its applicability. We provide
theoretical guarantees for the algorithm that show the budget is used
efficiently and the properties of revenue allocation are similar to Shapley's.
Finally, we conduct an empirical evaluation to show the performance of the
algorithm in practical scenarios and when compared to other baselines. Overall,
we believe that the new algorithm paves the way for the implementation of
practical data markets.
Related papers
- Targeted Learning for Data Fairness [52.59573714151884]
We expand fairness inference by evaluating fairness in the data generating process itself.
We derive estimators demographic parity, equal opportunity, and conditional mutual information.
To validate our approach, we perform several simulations and apply our estimators to real data.
arXiv Detail & Related papers (2025-02-06T18:51:28Z) - Data Acquisition for Improving Model Fairness using Reinforcement Learning [3.3916160303055563]
We focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness.
We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire.
We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.
arXiv Detail & Related papers (2024-12-04T03:56:54Z) - Compute-Constrained Data Selection [77.06528009072967]
We find that many powerful data selection methods are almost never compute-optimal.
For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - DAVED: Data Acquisition via Experimental Design for Data Markets [25.300193837833426]
We propose a federated approach to the data acquisition problem that is inspired by linear experimental design.
Our proposed data acquisition method achieves lower prediction error without requiring labeled validation data.
The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
arXiv Detail & Related papers (2024-03-20T18:05:52Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Preserving Fairness in AI under Domain Shift [15.820660013260584]
Existing algorithms for ensuring fairness in AI use a single-shot training strategy.
We develop an algorithm to adapt a fair model to remain fair under domain shift.
arXiv Detail & Related papers (2023-01-29T06:13:40Z) - Data Budgeting for Machine Learning [17.524791147624086]
We study the data budgeting problem and formulate it as two sub-problems.
We propose a learning method to solve data budgeting problems.
Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as $50$ data points.
arXiv Detail & Related papers (2022-10-03T14:53:17Z) - Self-supervised similarity models based on well-logging data [1.0723143072368782]
We present an approach that provides universal data representations suitable for solutions to different problems for different oil fields.
Our approach relies on the self-supervised methodology for sequential logging data for intervals from well.
We found out that using the variational autoencoder leads to the most reliable and accurate models.
arXiv Detail & Related papers (2022-09-26T06:24:08Z) - Augmented Bilinear Network for Incremental Multi-Stock Time-Series
Classification [83.23129279407271]
We propose a method to efficiently retain the knowledge available in a neural network pre-trained on a set of securities.
In our method, the prior knowledge encoded in a pre-trained neural network is maintained by keeping existing connections fixed.
This knowledge is adjusted for the new securities by a set of augmented connections, which are optimized using the new data.
arXiv Detail & Related papers (2022-07-23T18:54:10Z) - Data Sharing Markets [95.13209326119153]
We study a setup where each agent can be both buyer and seller of data.
We consider two cases: bilateral data exchange (trading data with data) and unilateral data exchange (trading data with money)
arXiv Detail & Related papers (2021-07-19T06:00:34Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.