CS-Shapley: Class-wise Shapley Values for Data Valuation in
Classification
- URL: http://arxiv.org/abs/2211.06800v1
- Date: Sun, 13 Nov 2022 03:32:33 GMT
- Title: CS-Shapley: Class-wise Shapley Values for Data Valuation in
Classification
- Authors: Stephanie Schoch, Haifeng Xu, Yangfeng Ji
- Abstract summary: We propose CS-Shapley, a Shapley value with a new value function that discriminates between training instances' in-class and out-of-class contributions.
Our results suggest Shapley-based data valuation is transferable for application across different models.
- Score: 24.44357623723746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data valuation, or the valuation of individual datum contributions, has seen
growing interest in machine learning due to its demonstrable efficacy for tasks
such as noisy label detection. In particular, due to the desirable axiomatic
properties, several Shapley value approximation methods have been proposed. In
these methods, the value function is typically defined as the predictive
accuracy over the entire development set. However, this limits the ability to
differentiate between training instances that are helpful or harmful to their
own classes. Intuitively, instances that harm their own classes may be noisy or
mislabeled and should receive a lower valuation than helpful instances. In this
work, we propose CS-Shapley, a Shapley value with a new value function that
discriminates between training instances' in-class and out-of-class
contributions. Our theoretical analysis shows the proposed value function is
(essentially) the unique function that satisfies two desirable properties for
evaluating data values in classification. Further, our experiments on two
benchmark evaluation tasks (data removal and noisy label detection) and four
classifiers demonstrate the effectiveness of CS-Shapley over existing methods.
Lastly, we evaluate the "transferability" of data values estimated from one
classifier to others, and our results suggest Shapley-based data valuation is
transferable for application across different models.
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Accelerated Shapley Value Approximation for Data Evaluation [3.707457963532597]
We show that Shapley value of data points can be approximated more efficiently by leveraging structural properties of machine learning problems.
Our analysis suggests that in fact models trained on small subsets are more important in context of data valuation.
arXiv Detail & Related papers (2023-11-09T13:15:36Z) - Shapley Value on Probabilistic Classifiers [6.163093930860032]
In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model.
Traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points.
We propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function.
arXiv Detail & Related papers (2023-06-12T15:09:13Z) - Efficient Shapley Values Estimation by Amortization for Text
Classification [66.7725354593271]
We develop an amortized model that directly predicts each input feature's Shapley Value without additional model evaluations.
Experimental results on two text classification datasets demonstrate that our amortized model estimates Shapley Values accurately with up to 60 times speedup.
arXiv Detail & Related papers (2023-05-31T16:19:13Z) - Data Banzhaf: A Robust Data Valuation Framework for Machine Learning [18.65808473565554]
This paper studies the robustness of data valuation to noisy model performance scores.
We introduce the concept of safety margin, which measures the robustness of a data value notion.
We show that the Banzhaf value achieves the largest safety margin among all semivalues.
arXiv Detail & Related papers (2022-05-30T23:44:09Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z) - Certified Robustness to Label-Flipping Attacks via Randomized Smoothing [105.91827623768724]
Machine learning algorithms are susceptible to data poisoning attacks.
We present a unifying view of randomized smoothing over arbitrary functions.
We propose a new strategy for building classifiers that are pointwise-certifiably robust to general data poisoning attacks.
arXiv Detail & Related papers (2020-02-07T21:28:30Z) - Interpretable feature subset selection: A Shapley value based approach [1.511944009967492]
We introduce the notion of classification game, a cooperative game with features as players and hinge loss based characteristic function.
Our major contribution is ($star$) to show that for any dataset the threshold 0 on SVEA value identifies feature subset whose joint interactions for label prediction is significant.
arXiv Detail & Related papers (2020-01-12T16:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.