Differentially Private Shapley Values for Data Evaluation
- URL: http://arxiv.org/abs/2206.00511v1
- Date: Wed, 1 Jun 2022 14:14:24 GMT
- Title: Differentially Private Shapley Values for Data Evaluation
- Authors: Lauren Watson, Rayna Andreeva, Hao-Tsung Yang, Rik Sarkar
- Abstract summary: Shapley values are computationally expensive and involve the entire dataset.
We propose a new stratified approximation method called the Layered Shapley Algorithm.
We prove that this method operates on small (O(polylog(n))) random samples of data and small sized ($O(log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy.
- Score: 3.616258473002814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Shapley value has been proposed as a solution to many applications in
machine learning, including for equitable valuation of data. Shapley values are
computationally expensive and involve the entire dataset. The query for a
point's Shapley value can also compromise the statistical privacy of other data
points. We observe that in machine learning problems such as empirical risk
minimization, and in many learning algorithms (such as those with uniform
stability), a diminishing returns property holds, where marginal benefit per
data point decreases rapidly with data sample size. Based on this property, we
propose a new stratified approximation method called the Layered Shapley
Algorithm. We prove that this method operates on small (O(\polylog(n))) random
samples of data and small sized ($O(\log n)$) coalitions to achieve the results
with guaranteed probabilistic accuracy, and can be modified to incorporate
differential privacy. Experimental results show that the algorithm correctly
identifies high-value data points that improve validation accuracy, and that
the differentially private evaluations preserve approximate ranking of data.
Related papers
- Efficient Data Shapley for Weighted Nearest Neighbor Algorithms [47.62605581521535]
WKNN-Shapley is an efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley)
We show WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.
arXiv Detail & Related papers (2024-01-20T03:34:18Z) - Accelerated Shapley Value Approximation for Data Evaluation [3.707457963532597]
We show that Shapley value of data points can be approximated more efficiently by leveraging structural properties of machine learning problems.
Our analysis suggests that in fact models trained on small subsets are more important in context of data valuation.
arXiv Detail & Related papers (2023-11-09T13:15:36Z) - Shapley Value on Probabilistic Classifiers [6.163093930860032]
In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model.
Traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points.
We propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function.
arXiv Detail & Related papers (2023-06-12T15:09:13Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Variance reduced Shapley value estimation for trustworthy data valuation [16.03510965397185]
We propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short)
We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS.
arXiv Detail & Related papers (2022-10-30T13:04:52Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Differentially Private Simple Linear Regression [2.614403183902121]
We study algorithms for simple linear regression that satisfy differential privacy.
We consider the design of differentially private algorithms for simple linear regression for small datasets.
We study the performance of a spectrum of algorithms we adapt to the setting.
arXiv Detail & Related papers (2020-07-10T04:28:43Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - A Distributional Framework for Data Valuation [26.065217938868617]
We develop an algorithm for estimating values from data that comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms.
We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.
arXiv Detail & Related papers (2020-02-27T18:51:35Z) - Towards Efficient Data Valuation Based on the Shapley Value [65.4167993220998]
We study the problem of data valuation by utilizing the Shapley value.
The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value.
We propose a repertoire of efficient algorithms for approximating the Shapley value.
arXiv Detail & Related papers (2019-02-27T00:22:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.