Data Acquisition via Experimental Design for Decentralized Data Markets
- URL: http://arxiv.org/abs/2403.13893v1
- Date: Wed, 20 Mar 2024 18:05:52 GMT
- Title: Data Acquisition via Experimental Design for Decentralized Data Markets
- Authors: Charles Lu, Baihe Huang, Sai Praneeth Karimireddy, Praneeth Vepakomma, Michael Jordan, Ramesh Raskar,
- Abstract summary: Data markets provide a way to increase the supply of data, particularly in data-scarce domains such as healthcare.
A major challenge for a data buyer in such a market is selecting the most valuable data points from a data seller.
We propose a federated approach to the data selection problem that is inspired by linear experimental design.
- Score: 25.300193837833426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Acquiring high-quality training data is essential for current machine learning models. Data markets provide a way to increase the supply of data, particularly in data-scarce domains such as healthcare, by incentivizing potential data sellers to join the market. A major challenge for a data buyer in such a market is selecting the most valuable data points from a data seller. Unlike prior work in data valuation, which assumes centralized data access, we propose a federated approach to the data selection problem that is inspired by linear experimental design. Our proposed data selection method achieves lower prediction error without requiring labeled validation data and can be optimized in a fast and federated procedure. The key insight of our work is that a method that directly estimates the benefit of acquiring data for test set prediction is particularly compatible with a decentralized market setting.
Related papers
- Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Data Measurements for Decentralized Data Markets [18.99870296998749]
Decentralized data markets can provide more equitable forms of data acquisition for machine learning.
We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets.
arXiv Detail & Related papers (2024-06-06T17:03:51Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Addressing Budget Allocation and Revenue Allocation in Data Market
Environments Using an Adaptive Sampling Algorithm [14.206050847214652]
We introduce a new algorithm to solve budget allocation and revenue allocation problems simultaneously in linear time.
The new algorithm employs an adaptive sampling process that selects data from those providers who are contributing the most to the model.
We provide theoretical guarantees for the algorithm that show the budget is used efficiently and the properties of revenue allocation are similar to Shapley's.
arXiv Detail & Related papers (2023-06-05T02:28:19Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Fundamentals of Task-Agnostic Data Valuation [21.78555506720078]
We study valuing the data of a data owner/seller for a data seeker/buyer.
We focus on task-agnostic data valuation without any validation requirements.
arXiv Detail & Related papers (2022-08-25T22:07:07Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - OSOUM Framework for Trading Data Research [79.0383470835073]
We supply, to the best of our knowledge, the first open source simulation platform, Open SOUrce Market Simulator (OSOUM) to analyze trading markets and specifically data markets.
We describe and implement a specific data market model, consisting of two types of agents: sellers who own various datasets available for acquisition, and buyers searching for relevant and beneficial datasets for purchase.
Although commercial frameworks, intended for handling data markets, already exist, we provide a free and extensive end-to-end research tool for simulating possible behavior for both buyers and sellers participating in (data) markets.
arXiv Detail & Related papers (2021-02-18T09:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.