Shapley Value on Probabilistic Classifiers
- URL: http://arxiv.org/abs/2306.07171v1
- Date: Mon, 12 Jun 2023 15:09:13 GMT
- Title: Shapley Value on Probabilistic Classifiers
- Authors: Xiang Li and Haocheng Xia and Jinfei Liu
- Abstract summary: In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model.
Traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points.
We propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function.
- Score: 6.163093930860032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data valuation has become an increasingly significant discipline in data
science due to the economic value of data. In the context of machine learning
(ML), data valuation methods aim to equitably measure the contribution of each
data point to the utility of an ML model. One prevalent method is Shapley
value, which helps identify data points that are beneficial or detrimental to
an ML model. However, traditional Shapley-based data valuation methods may not
effectively distinguish between beneficial and detrimental training data points
for probabilistic classifiers. In this paper, we propose Probabilistic Shapley
(P-Shapley) value by constructing a probability-wise utility function that
leverages the predicted class probabilities of probabilistic classifiers rather
than binarized prediction results in the traditional Shapley value. We also
offer several activation functions for confidence calibration to effectively
quantify the marginal contribution of each data point to the probabilistic
classifiers. Extensive experiments on four real-world datasets demonstrate the
effectiveness of our proposed P-Shapley value in evaluating the importance of
data for building a high-usability and trustworthy ML model.
Related papers
- A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework in Large Language Models (LLMs)
We derive novel metrics with high-probability guarantees concerning the output distribution of a model.
Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation [62.2436697657307]
Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data.
We propose a method called Stratified Prediction-Powered Inference (StratPPI)
We show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies.
arXiv Detail & Related papers (2024-06-06T17:37:39Z) - EcoVal: An Efficient Data Valuation Framework for Machine Learning [11.685518953430554]
Existing Shapley value based frameworks for data valuation in machine learning are computationally expensive.
We introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner.
arXiv Detail & Related papers (2024-02-14T16:21:47Z) - Accelerated Shapley Value Approximation for Data Evaluation [3.707457963532597]
We show that Shapley value of data points can be approximated more efficiently by leveraging structural properties of machine learning problems.
Our analysis suggests that in fact models trained on small subsets are more important in context of data valuation.
arXiv Detail & Related papers (2023-11-09T13:15:36Z) - Efficient Shapley Values Estimation by Amortization for Text
Classification [66.7725354593271]
We develop an amortized model that directly predicts each input feature's Shapley Value without additional model evaluations.
Experimental results on two text classification datasets demonstrate that our amortized model estimates Shapley Values accurately with up to 60 times speedup.
arXiv Detail & Related papers (2023-05-31T16:19:13Z) - CS-Shapley: Class-wise Shapley Values for Data Valuation in
Classification [24.44357623723746]
We propose CS-Shapley, a Shapley value with a new value function that discriminates between training instances' in-class and out-of-class contributions.
Our results suggest Shapley-based data valuation is transferable for application across different models.
arXiv Detail & Related papers (2022-11-13T03:32:33Z) - Differentially Private Shapley Values for Data Evaluation [3.616258473002814]
Shapley values are computationally expensive and involve the entire dataset.
We propose a new stratified approximation method called the Layered Shapley Algorithm.
We prove that this method operates on small (O(polylog(n))) random samples of data and small sized ($O(log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy.
arXiv Detail & Related papers (2022-06-01T14:14:24Z) - Data Banzhaf: A Robust Data Valuation Framework for Machine Learning [18.65808473565554]
This paper studies the robustness of data valuation to noisy model performance scores.
We introduce the concept of safety margin, which measures the robustness of a data value notion.
We show that the Banzhaf value achieves the largest safety margin among all semivalues.
arXiv Detail & Related papers (2022-05-30T23:44:09Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Towards Efficient Data Valuation Based on the Shapley Value [65.4167993220998]
We study the problem of data valuation by utilizing the Shapley value.
The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value.
We propose a repertoire of efficient algorithms for approximating the Shapley value.
arXiv Detail & Related papers (2019-02-27T00:22:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.