Data Banzhaf: A Robust Data Valuation Framework for Machine Learning
- URL: http://arxiv.org/abs/2205.15466v7
- Date: Mon, 18 Dec 2023 14:57:40 GMT
- Title: Data Banzhaf: A Robust Data Valuation Framework for Machine Learning
- Authors: Jiachen T. Wang, Ruoxi Jia
- Abstract summary: This paper studies the robustness of data valuation to noisy model performance scores.
We introduce the concept of safety margin, which measures the robustness of a data value notion.
We show that the Banzhaf value achieves the largest safety margin among all semivalues.
- Score: 18.65808473565554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data valuation has wide use cases in machine learning, including improving
data quality and creating economic incentives for data sharing. This paper
studies the robustness of data valuation to noisy model performance scores.
Particularly, we find that the inherent randomness of the widely used
stochastic gradient descent can cause existing data value notions (e.g., the
Shapley value and the Leave-one-out error) to produce inconsistent data value
rankings across different runs. To address this challenge, we introduce the
concept of safety margin, which measures the robustness of a data value notion.
We show that the Banzhaf value, a famous value notion that originated from
cooperative game theory literature, achieves the largest safety margin among
all semivalues (a class of value notions that satisfy crucial properties
entailed by ML applications and include the famous Shapley value and
Leave-one-out error). We propose an algorithm to efficiently estimate the
Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation
demonstrates that the Banzhaf value outperforms the existing semivalue-based
data value notions on several ML tasks such as learning with weighted samples
and noisy label detection. Overall, our study suggests that when the underlying
ML algorithm is stochastic, the Banzhaf value is a promising alternative to the
other semivalue-based data value schemes given its computational advantage and
ability to robustly differentiate data quality.
Related papers
- Is Data Valuation Learnable and Interpretable? [3.9325957466009203]
Current data valuation methods ignore the interpretability of the output values.
This study aims to answer an important question: is data valuation learnable and interpretable?
arXiv Detail & Related papers (2024-06-03T08:13:47Z) - EcoVal: An Efficient Data Valuation Framework for Machine Learning [11.685518953430554]
Existing Shapley value based frameworks for data valuation in machine learning are computationally expensive.
We introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner.
arXiv Detail & Related papers (2024-02-14T16:21:47Z) - Shapley Value on Probabilistic Classifiers [6.163093930860032]
In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model.
Traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points.
We propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function.
arXiv Detail & Related papers (2023-06-12T15:09:13Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - CS-Shapley: Class-wise Shapley Values for Data Valuation in
Classification [24.44357623723746]
We propose CS-Shapley, a Shapley value with a new value function that discriminates between training instances' in-class and out-of-class contributions.
Our results suggest Shapley-based data valuation is transferable for application across different models.
arXiv Detail & Related papers (2022-11-13T03:32:33Z) - Differentially Private Shapley Values for Data Evaluation [3.616258473002814]
Shapley values are computationally expensive and involve the entire dataset.
We propose a new stratified approximation method called the Layered Shapley Algorithm.
We prove that this method operates on small (O(polylog(n))) random samples of data and small sized ($O(log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy.
arXiv Detail & Related papers (2022-06-01T14:14:24Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Towards Efficient Data Valuation Based on the Shapley Value [65.4167993220998]
We study the problem of data valuation by utilizing the Shapley value.
The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value.
We propose a repertoire of efficient algorithms for approximating the Shapley value.
arXiv Detail & Related papers (2019-02-27T00:22:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.