Thresholding Data Shapley for Data Cleansing Using Multi-Armed Bandits
- URL: http://arxiv.org/abs/2402.08209v1
- Date: Tue, 13 Feb 2024 04:17:48 GMT
- Title: Thresholding Data Shapley for Data Cleansing Using Multi-Armed Bandits
- Authors: Hiroyuki Namba, Shota Horiguchi, Masaki Hamamoto, Masashi Egi
- Abstract summary: Data cleansing aims to improve model performance by removing a set of harmful instances from the training dataset.
Data Shapley is a common theoretically guaranteed method to evaluate the contribution of each instance to model performance.
We propose an iterativemethod to fast identify a subset of instances with low data Shapley values by using the thresholding bandit algorithm.
- Score: 7.335578524351567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data cleansing aims to improve model performance by removing a set of harmful
instances from the training dataset. Data Shapley is a common theoretically
guaranteed method to evaluate the contribution of each instance to model
performance; however, it requires training on all subsets of the training data,
which is computationally expensive. In this paper, we propose an
iterativemethod to fast identify a subset of instances with low data Shapley
values by using the thresholding bandit algorithm. We provide a theoretical
guarantee that the proposed method can accurately select harmful instances if a
sufficiently large number of iterations is conducted. Empirical evaluation
using various models and datasets demonstrated that the proposed method
efficiently improved the computational speed while maintaining the model
performance.
Related papers
- Fast-DataShapley: Neural Modeling for Training Data Valuation [40.630258021732544]
We propose Fast-DataShapley, a one-pass training method to train a reusable explainer model with real-time reasoning speed.<n>Given new test samples, no retraining is required to calculate the Shapley values of the training data.
arXiv Detail & Related papers (2025-06-05T17:35:46Z) - Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value [18.858879113762917]
We propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently.<n>Our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data.<n>This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.
arXiv Detail & Related papers (2025-05-22T02:46:03Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks [40.91931801667421]
This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization.
As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task.
arXiv Detail & Related papers (2025-02-01T01:52:32Z) - SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training [12.745160748376794]
We propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness.
Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication.
Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps.
arXiv Detail & Related papers (2024-07-09T08:26:39Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective.
This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z) - Fast Shapley Value Estimation: A Unified Approach [71.92014859992263]
We propose a straightforward and efficient Shapley estimator, SimSHAP, by eliminating redundant techniques.
In our analysis of existing approaches, we observe that estimators can be unified as a linear transformation of randomly summed values from feature subsets.
Our experiments validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
arXiv Detail & Related papers (2023-11-02T06:09:24Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Data Selection for Fine-tuning Large Language Models Using Transferred
Shapley Values [10.53825744656208]
We propose TS-DShapley, an algorithm that reduces computational cost of Shapley-based data valuation.
Experiments applying TS-DShapley to select data for fine-tuning BERT-based language models on benchmark natural language understanding (NLU) datasets show that TS-DShapley outperforms existing data selection methods.
arXiv Detail & Related papers (2023-06-16T20:07:38Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Non-iterative optimization of pseudo-labeling thresholds for training
object detection models from multiple datasets [2.1485350418225244]
We propose a non-iterative method to optimize pseudo-labeling thresholds for learning object detection from a collection of low-cost datasets.
We experimentally demonstrate that our proposed method achieves an mAP comparable to that of grid search on the COCO and VOC datasets.
arXiv Detail & Related papers (2022-10-19T00:31:34Z) - Efficient Learning of Accurate Surrogates for Simulations of Complex Systems [0.0]
We introduce an online learning method empowered by sampling-driven sampling.
It ensures that all turning points on the model response surface are included in the training data.
We apply our method to simulations of nuclear matter to demonstrate that highly accurate surrogates can be reliably auto-generated.
arXiv Detail & Related papers (2022-07-11T20:51:11Z) - Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively.
We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.