Related papers: Accelerated Shapley Value Approximation for Data Evaluation

Accelerated Shapley Value Approximation for Data Evaluation

URL: http://arxiv.org/abs/2311.05346v1
Date: Thu, 9 Nov 2023 13:15:36 GMT
Title: Accelerated Shapley Value Approximation for Data Evaluation
Authors: Lauren Watson, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, Rik Sarkar
Abstract summary: We show that Shapley value of data points can be approximated more efficiently by leveraging structural properties of machine learning problems. Our analysis suggests that in fact models trained on small subsets are more important in context of data valuation.
Score: 3.707457963532597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective. This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z)
Fast Shapley Value Estimation: A Unified Approach [71.92014859992263]
We propose a straightforward and efficient Shapley estimator, SimSHAP, by eliminating redundant techniques. In our analysis of existing approaches, we observe that estimators can be unified as a linear transformation of randomly summed values from feature subsets. Our experiments validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
arXiv Detail & Related papers (2023-11-02T06:09:24Z)
An Efficient Shapley Value Computation for the Naive Bayes Classifier [0.0]
This article proposes an exact analytic expression of Shapley values in the case of the naive Bayes classifier. Results show that our Shapley proposal for the naive Bayes provides informative results with low algorithmic complexity.
arXiv Detail & Related papers (2023-07-31T14:39:10Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
Efficient Shapley Values Estimation by Amortization for Text Classification [66.7725354593271]
We develop an amortized model that directly predicts each input feature's Shapley Value without additional model evaluations. Experimental results on two text classification datasets demonstrate that our amortized model estimates Shapley Values accurately with up to 60 times speedup.
arXiv Detail & Related papers (2023-05-31T16:19:13Z)
CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification [24.44357623723746]
We propose CS-Shapley, a Shapley value with a new value function that discriminates between training instances' in-class and out-of-class contributions. Our results suggest Shapley-based data valuation is transferable for application across different models.
arXiv Detail & Related papers (2022-11-13T03:32:33Z)
Differentially Private Shapley Values for Data Evaluation [3.616258473002814]
Shapley values are computationally expensive and involve the entire dataset. We propose a new stratified approximation method called the Layered Shapley Algorithm. We prove that this method operates on small (O(polylog(n))) random samples of data and small sized ($O(log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy.
arXiv Detail & Related papers (2022-06-01T14:14:24Z)
A Multilinear Sampling Algorithm to Estimate Shapley Values [4.771833920251869]
We propose a new sampling method based on a multilinear extension technique as applied in game theory. Our method is applicable to any machine learning model, in particular for either multi-class classifications or regression problems.
arXiv Detail & Related papers (2020-10-22T21:47:16Z)
AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective. We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
Efficient computation and analysis of distributional Shapley values [15.322542729755998]
We derive the first analytic expressions for DShapley for the canonical problems of linear regression, binary classification, and non-parametric density estimation. Our formulas are directly interpretable and provide quantitative insights into how the value varies for different types of data.
arXiv Detail & Related papers (2020-07-02T19:51:54Z)
Towards Efficient Data Valuation Based on the Shapley Value [65.4167993220998]
We study the problem of data valuation by utilizing the Shapley value. The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value. We propose a repertoire of efficient algorithms for approximating the Shapley value.
arXiv Detail & Related papers (2019-02-27T00:22:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.