A Note on "Towards Efficient Data Valuation Based on the Shapley Value''
- URL: http://arxiv.org/abs/2302.11431v1
- Date: Wed, 22 Feb 2023 15:13:45 GMT
- Title: A Note on "Towards Efficient Data Valuation Based on the Shapley Value''
- Authors: Jiachen T. Wang, Ruoxi Jia
- Abstract summary: The Shapley value (SV) has emerged as a promising method for data valuation.
Group Testing-based SV estimator achieves favorable sample complexity.
- Score: 7.4011772612133475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Shapley value (SV) has emerged as a promising method for data valuation.
However, computing or estimating the SV is often computationally expensive. To
overcome this challenge, Jia et al. (2019) propose an advanced SV estimation
algorithm called ``Group Testing-based SV estimator'' which achieves favorable
asymptotic sample complexity. In this technical note, we present several
improvements in the analysis and design choices of this SV estimator. Moreover,
we point out that the Group Testing-based SV estimator does not fully reuse the
collected samples. Our analysis and insights contribute to a better
understanding of the challenges in developing efficient SV estimation
algorithms for data valuation.
Related papers
- Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help? [0.0]
We show that mitigating data imbalance can significantly improve the predictive performance of models for all the Common Vulnerability Scoring System (CVSS) tasks.
We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board.
arXiv Detail & Related papers (2024-07-15T13:47:55Z) - Fast Shapley Value Estimation: A Unified Approach [71.92014859992263]
We propose a straightforward and efficient Shapley estimator, SimSHAP, by eliminating redundant techniques.
In our analysis of existing approaches, we observe that estimators can be unified as a linear transformation of randomly summed values from feature subsets.
Our experiments validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
arXiv Detail & Related papers (2023-11-02T06:09:24Z) - DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation [23.646508094051768]
We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain.
The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification.
We propose a novel approximation, referred to as discrete uniform Shapley, which is expressed as an expectation under a discrete uniform distribution.
arXiv Detail & Related papers (2023-06-03T10:22:50Z) - Probably Approximate Shapley Fairness with Applications in Machine
Learning [18.05783128571293]
The Shapley value (SV) is adopted in various scenarios in machine learning (ML)
As exact SVs are infeasible to compute in practice, SV estimates are approximated instead.
This approximation step raises an important question: do the SV estimates preserve the fairness guarantees of exact SVs?
We observe that the fairness guarantees of exact SVs are too restrictive for SV estimates.
arXiv Detail & Related papers (2022-12-01T16:28:20Z) - Design Guidelines for Inclusive Speaker Verification Evaluation Datasets [0.6015898117103067]
Speaker verification (SV) provides billions of voice-enabled devices with access control, and ensures the security of voice-driven technologies.
Current SV evaluation practices are insufficient for evaluating bias: they are over-simplified and aggregate users, not representative of real-life usage scenarios.
This paper proposes design guidelines for constructing SV evaluation datasets that address these short-comings.
arXiv Detail & Related papers (2022-04-05T15:28:26Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Active Surrogate Estimators: An Active Learning Approach to
Label-Efficient Model Evaluation [59.7305309038676]
We propose Active Surrogate Estimators (ASEs) for model evaluation.
We find that ASEs offer greater label-efficiency than the current state-of-the-art.
arXiv Detail & Related papers (2022-02-14T17:15:18Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - A Survey on Data-driven Software Vulnerability Assessment and
Prioritization [0.0]
Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems.
Data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level.
arXiv Detail & Related papers (2021-07-18T04:49:22Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.