Data Banzhaf: A Robust Data Valuation Framework for Machine Learning
- URL: http://arxiv.org/abs/2205.15466v7
- Date: Mon, 18 Dec 2023 14:57:40 GMT
- Title: Data Banzhaf: A Robust Data Valuation Framework for Machine Learning
- Authors: Jiachen T. Wang, Ruoxi Jia
- Abstract summary: This paper studies the robustness of data valuation to noisy model performance scores.
We introduce the concept of safety margin, which measures the robustness of a data value notion.
We show that the Banzhaf value achieves the largest safety margin among all semivalues.
- Score: 18.65808473565554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data valuation has wide use cases in machine learning, including improving
data quality and creating economic incentives for data sharing. This paper
studies the robustness of data valuation to noisy model performance scores.
Particularly, we find that the inherent randomness of the widely used
stochastic gradient descent can cause existing data value notions (e.g., the
Shapley value and the Leave-one-out error) to produce inconsistent data value
rankings across different runs. To address this challenge, we introduce the
concept of safety margin, which measures the robustness of a data value notion.
We show that the Banzhaf value, a famous value notion that originated from
cooperative game theory literature, achieves the largest safety margin among
all semivalues (a class of value notions that satisfy crucial properties
entailed by ML applications and include the famous Shapley value and
Leave-one-out error). We propose an algorithm to efficiently estimate the
Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation
demonstrates that the Banzhaf value outperforms the existing semivalue-based
data value notions on several ML tasks such as learning with weighted samples
and noisy label detection. Overall, our study suggests that when the underlying
ML algorithm is stochastic, the Banzhaf value is a promising alternative to the
other semivalue-based data value schemes given its computational advantage and
ability to robustly differentiate data quality.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.