Data Distribution Valuation
- URL: http://arxiv.org/abs/2410.04386v1
- Date: Sun, 6 Oct 2024 07:56:53 GMT
- Title: Data Distribution Valuation
- Authors: Xinyi Xu, Shuaiqi Wang, Chuan-Sheng Foo, Bryan Kian Hsiang Low, Giulia Fanti,
- Abstract summary: Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
- Score: 56.71023681599737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a discrete dataset. However, in many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small preview sample from each vendor, to decide which vendor's data distribution is most useful to the buyer and purchase. The core question is how should we compare the values of data distributions from their samples? Under a Huber characterization of the data heterogeneity across vendors, we propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).
Related papers
- Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - Data Valuation by Leveraging Global and Local Statistical Information [25.911043100052588]
We show that both global and local value distributions hold significant potential for data valuation within the context of machine learning.
We propose a new data valuation method that estimates Shapley values by incorporating the explored distribution characteristics into an existing method, AME.
We also present a new path to address the dynamic data valuation problem by formulating an optimization problem that integrates information of both global and local value distributions.
arXiv Detail & Related papers (2024-05-23T08:58:08Z) - Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV)
Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state.
In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z) - Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value [17.340091573913316]
We propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate.
Data-OOB takes less than 2.25 hours on a single CPU processor when there are $106$ samples to evaluate and the input dimension is 100.
We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points.
arXiv Detail & Related papers (2023-04-16T08:03:58Z) - Project and Probe: Sample-Efficient Domain Adaptation by Interpolating
Orthogonal Features [119.22672589020394]
We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features.
Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$2$ improves performance by 5-15% when given limited target data.
arXiv Detail & Related papers (2023-02-10T18:58:03Z) - Fundamentals of Task-Agnostic Data Valuation [21.78555506720078]
We study valuing the data of a data owner/seller for a data seeker/buyer.
We focus on task-agnostic data valuation without any validation requirements.
arXiv Detail & Related papers (2022-08-25T22:07:07Z) - Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics [58.50754318846996]
In this paper, we show that the performances of metrics are sensitive to data.
The ranking of metrics varies when the evaluation is conducted on different datasets.
arXiv Detail & Related papers (2022-03-29T18:58:28Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.