Related papers: Differentially Private Shapley Values for Data Evaluation

Differentially Private Shapley Values for Data Evaluation

URL: http://arxiv.org/abs/2206.00511v1
Date: Wed, 1 Jun 2022 14:14:24 GMT
Title: Differentially Private Shapley Values for Data Evaluation
Authors: Lauren Watson, Rayna Andreeva, Hao-Tsung Yang, Rik Sarkar
Abstract summary: Shapley values are computationally expensive and involve the entire dataset. We propose a new stratified approximation method called the Layered Shapley Algorithm. We prove that this method operates on small (O(polylog(n))) random samples of data and small sized ($O(log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy.
Score: 3.616258473002814
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Shapley value has been proposed as a solution to many applications in machine learning, including for equitable valuation of data. Shapley values are computationally expensive and involve the entire dataset. The query for a point's Shapley value can also compromise the statistical privacy of other data points. We observe that in machine learning problems such as empirical risk minimization, and in many learning algorithms (such as those with uniform stability), a diminishing returns property holds, where marginal benefit per data point decreases rapidly with data sample size. Based on this property, we propose a new stratified approximation method called the Layered Shapley Algorithm. We prove that this method operates on small (O(\polylog(n))) random samples of data and small sized ($O(\log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy, and can be modified to incorporate differential privacy. Experimental results show that the algorithm correctly identifies high-value data points that improve validation accuracy, and that the differentially private evaluations preserve approximate ranking of data.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Efficient Data Shapley for Weighted Nearest Neighbor Algorithms [47.62605581521535]
WKNN-Shapley is an efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley) We show WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.
arXiv Detail & Related papers (2024-01-20T03:34:18Z)
Accelerated Shapley Value Approximation for Data Evaluation [3.707457963532597]
We show that Shapley value of data points can be approximated more efficiently by leveraging structural properties of machine learning problems. Our analysis suggests that in fact models trained on small subsets are more important in context of data valuation.
arXiv Detail & Related papers (2023-11-09T13:15:36Z)
Shapley Value on Probabilistic Classifiers [6.163093930860032]
In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model. Traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points. We propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function.
arXiv Detail & Related papers (2023-06-12T15:09:13Z)
Learning to Bound Counterfactual Inference in Structural Causal Models from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm. The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources. It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z)
Variance reduced Shapley value estimation for trustworthy data valuation [16.03510965397185]
We propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short) We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS.
arXiv Detail & Related papers (2022-10-30T13:04:52Z)
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
Evaluating representations by the complexity of learning low-loss predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
Differentially Private Simple Linear Regression [2.614403183902121]
We study algorithms for simple linear regression that satisfy differential privacy. We consider the design of differentially private algorithms for simple linear regression for small datasets. We study the performance of a spectrum of algorithms we adapt to the setting.
arXiv Detail & Related papers (2020-07-10T04:28:43Z)
Learning while Respecting Privacy and Robustness to Distributional Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model. The objective is to endow the trained model with robustness against adversarially manipulated input data. Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
A Distributional Framework for Data Valuation [26.065217938868617]
We develop an algorithm for estimating values from data that comes with formal guarantees and runs two orders of magnitude faster than state-of-the-art algorithms. We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting.
arXiv Detail & Related papers (2020-02-27T18:51:35Z)
Towards Efficient Data Valuation Based on the Shapley Value [65.4167993220998]
We study the problem of data valuation by utilizing the Shapley value. The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value. We propose a repertoire of efficient algorithms for approximating the Shapley value.
arXiv Detail & Related papers (2019-02-27T00:22:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.