Efficient Data Shapley for Weighted Nearest Neighbor Algorithms
- URL: http://arxiv.org/abs/2401.11103v1
- Date: Sat, 20 Jan 2024 03:34:18 GMT
- Title: Efficient Data Shapley for Weighted Nearest Neighbor Algorithms
- Authors: Jiachen T. Wang, Prateek Mittal, and Ruoxi Jia
- Abstract summary: WKNN-Shapley is an efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley)
We show WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.
- Score: 47.62605581521535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to address an open problem in data valuation literature
concerning the efficient computation of Data Shapley for weighted $K$ nearest
neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label
KNN with discretized weights as the utility function, we reframe the
computation of WKNN-Shapley into a counting problem and introduce a
quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the
best result from existing literature. We develop a deterministic approximation
algorithm that further improves computational efficiency while maintaining the
key fairness properties of the Shapley value. Through extensive experiments, we
demonstrate WKNN-Shapley's computational efficiency and its superior
performance in discerning data quality compared to its unweighted counterpart.
Related papers
- Fast Shapley Value Estimation: A Unified Approach [71.92014859992263]
We propose a straightforward and efficient Shapley estimator, SimSHAP, by eliminating redundant techniques.
In our analysis of existing approaches, we observe that estimators can be unified as a linear transformation of randomly summed values from feature subsets.
Our experiments validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
arXiv Detail & Related papers (2023-11-02T06:09:24Z) - High-dimensional Contextual Bandit Problem without Sparsity [8.782204980889077]
We propose an explore-then-commit (EtC) algorithm to address this problem and examine its performance.
We derive the optimal rate of the ETC algorithm in terms of $T$ and show that this rate can be achieved by balancing exploration and exploitation.
We introduce an adaptive explore-then-commit (AEtC) algorithm that adaptively finds the optimal balance.
arXiv Detail & Related papers (2023-06-19T15:29:32Z) - A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor
Algorithms" [18.65808473565554]
Jia et al. showed that for K-Nearest Neighbors (KNN) models, the algorithm of Data Shapley is surprisingly simple and efficient.
We propose a more natural and interpretable utility function that better reflects the performance of KNN models.
Our new approach, dubbed soft-label KNNSV, achieves the same time as the original method.
arXiv Detail & Related papers (2023-04-09T15:31:53Z) - Optimizing Data Shapley Interaction Calculation from O(2^n) to O(t n^2)
for KNN models [2.365702128814616]
"STI-KNN" is an innovative algorithm that calculates the exact pair-interaction Shapley values for KNN models in O(t n2) time.
By using STI-KNN, we can efficiently and accurately evaluate the value of individual data points, leading to improved training outcomes and ultimately enhancing the effectiveness of artificial intelligence applications.
arXiv Detail & Related papers (2023-04-02T06:15:19Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Accelerating Shapley Explanation via Contributive Cooperator Selection [42.11059072201565]
We propose a novel method SHEAR to significantly accelerate the Shapley explanation for DNN models.
The selection of the feature coalitions follows our proposed Shapley chain rule to minimize the absolute error from the ground-truth Shapley values.
SHEAR consistently outperforms state-of-the-art baseline methods across different evaluation metrics.
arXiv Detail & Related papers (2022-06-17T03:24:45Z) - Differentially Private Shapley Values for Data Evaluation [3.616258473002814]
Shapley values are computationally expensive and involve the entire dataset.
We propose a new stratified approximation method called the Layered Shapley Algorithm.
We prove that this method operates on small (O(polylog(n))) random samples of data and small sized ($O(log n)$) coalitions to achieve the results with guaranteed probabilistic accuracy.
arXiv Detail & Related papers (2022-06-01T14:14:24Z) - Channel Assignment in Uplink Wireless Communication using Machine
Learning Approach [54.012791474906514]
This letter investigates a channel assignment problem in uplink wireless communication systems.
Our goal is to maximize the sum rate of all users subject to integer channel assignment constraints.
Due to high computational complexity, machine learning approaches are employed to obtain computational efficient solutions.
arXiv Detail & Related papers (2020-01-12T15:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.