Variance reduced Shapley value estimation for trustworthy data valuation
- URL: http://arxiv.org/abs/2210.16835v5
- Date: Mon, 22 May 2023 08:34:46 GMT
- Title: Variance reduced Shapley value estimation for trustworthy data valuation
- Authors: Mengmeng Wu, Ruoxi Jia, Changle Lin, Wei Huang, Xiangyu Chang
- Abstract summary: We propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short)
We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS.
- Score: 16.03510965397185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data valuation, especially quantifying data value in algorithmic prediction
and decision-making, is a fundamental problem in data trading scenarios. The
most widely used method is to define the data Shapley and approximate it by
means of the permutation sampling algorithm. To make up for the large
estimation variance of the permutation sampling that hinders the development of
the data marketplace, we propose a more robust data valuation method using
stratified sampling, named variance reduced data Shapley (VRDS for short). We
theoretically show how to stratify, how many samples are taken at each stratum,
and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS
is illustrated in different types of datasets and data removal applications.
Related papers
- Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value [17.340091573913316]
We propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate.
Data-OOB takes less than 2.25 hours on a single CPU processor when there are $106$ samples to evaluate and the input dimension is 100.
We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points.
arXiv Detail & Related papers (2023-04-16T08:03:58Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - Differentially Private Shapley Values for Data Evaluation [3.616258473002814]
Shapley values are computationally expensive and involve the entire dataset.
We propose a new stratified approximation method called the Layered Shapley Algorithm.
We prove that this method operates on small (O(polylog(n))) random samples of data and small sized ($O(log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy.
arXiv Detail & Related papers (2022-06-01T14:14:24Z) - More Informed Random Sample Consensus [1.827510863075184]
We propose a method that samples data with a L'evy distribution together with a data sorting algorithm.
In the hypothesis sampling step of the proposed method, data is sorted with a sorting algorithm we proposed, which sorts data based on the likelihood of a data point being in the inlier set.
Then, hypotheses are sampled from the sorted data with L'evy distribution.
arXiv Detail & Related papers (2020-11-18T06:43:50Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Relationship-aware Multivariate Sampling Strategy for Scientific
Simulation Data [4.2855912967712815]
In this work, we propose a multivariate sampling strategy which preserves the original variable relationships.
Our proposed strategy utilizes principal component analysis to capture the variance of multivariate data and can be built on top of any existing state-of-the-art sampling algorithms for single variables.
arXiv Detail & Related papers (2020-08-31T00:52:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.