Uncertainty Quantification of Data Shapley via Statistical Inference
- URL: http://arxiv.org/abs/2407.19373v1
- Date: Sun, 28 Jul 2024 02:54:27 GMT
- Title: Uncertainty Quantification of Data Shapley via Statistical Inference
- Authors: Mengmeng Wu, Zhihong Liu, Xiang Li, Ruoxi Jia, Xiangyu Chang,
- Abstract summary: The emergence of data markets underscores the growing importance of data valuation.
Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation.
This paper establishes the relationship between Data Shapley and infinite-order U-statistics.
- Score: 20.35973700939768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As data plays an increasingly pivotal role in decision-making, the emergence of data markets underscores the growing importance of data valuation. Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation. However, a limitation of Data Shapley is its assumption of a fixed dataset, contrasting with the dynamic nature of real-world applications where data constantly evolves and expands. This paper establishes the relationship between Data Shapley and infinite-order U-statistics and addresses this limitation by quantifying the uncertainty of Data Shapley with changes in data distribution from the perspective of U-statistics. We make statistical inferences on data valuation to obtain confidence intervals for the estimations. We construct two different algorithms to estimate this uncertainty and provide recommendations for their applicable situations. We also conduct a series of experiments on various datasets to verify asymptotic normality and propose a practical trading scenario enabled by this method.
Related papers
- Federated Prediction-Powered Inference from Decentralized Data [40.84399531998246]
Prediction-Powered Inference (PPI) has been proposed to ensure statistical validity despite the unreliability.
The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI.
arXiv Detail & Related papers (2024-09-03T09:14:18Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets [0.0]
We study the applicability of procedures based on combining rules to the analysis of DIPS datasets.
Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.
arXiv Detail & Related papers (2024-05-08T02:33:35Z) - Uncertainty for Active Learning on Graphs [70.44714133412592]
Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models.
We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies.
We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries.
arXiv Detail & Related papers (2024-05-02T16:50:47Z) - Variance reduced Shapley value estimation for trustworthy data valuation [16.03510965397185]
We propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short)
We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS.
arXiv Detail & Related papers (2022-10-30T13:04:52Z) - Investigating Data Variance in Evaluations of Automatic Machine
Translation Metrics [58.50754318846996]
In this paper, we show that the performances of metrics are sensitive to data.
The ranking of metrics varies when the evaluation is conducted on different datasets.
arXiv Detail & Related papers (2022-03-29T18:58:28Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Evaluating Model Robustness and Stability to Dataset Shift [7.369475193451259]
We propose a framework for analyzing stability of machine learning models.
We use the original evaluation data to determine distributions under which the algorithm performs poorly.
We estimate the algorithm's performance on the "worst-case" distribution.
arXiv Detail & Related papers (2020-10-28T17:35:39Z) - On Disentangled Representations Learned From Correlated Data [59.41587388303554]
We bridge the gap to real-world scenarios by analyzing the behavior of the most prominent disentanglement approaches on correlated data.
We show that systematically induced correlations in the dataset are being learned and reflected in the latent representations.
We also demonstrate how to resolve these latent correlations, either using weak supervision during training or by post-hoc correcting a pre-trained model with a small number of labels.
arXiv Detail & Related papers (2020-06-14T12:47:34Z) - A Distributional Framework for Data Valuation [26.065217938868617]
We develop an algorithm for estimating values from data that comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms.
We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.
arXiv Detail & Related papers (2020-02-27T18:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.