Fast-DataShapley: Neural Modeling for Training Data Valuation
- URL: http://arxiv.org/abs/2506.05281v2
- Date: Fri, 13 Jun 2025 02:29:20 GMT
- Title: Fast-DataShapley: Neural Modeling for Training Data Valuation
- Authors: Haifeng Sun, Yu Xiong, Runze Wu, Xinyu Cai, Changjie Fan, Lan Zhang, Xiang-Yang Li,
- Abstract summary: We propose Fast-DataShapley, a one-pass training method to train a reusable explainer model with real-time reasoning speed.<n>Given new test samples, no retraining is required to calculate the Shapley values of the training data.
- Score: 40.630258021732544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The value and copyright of training data are crucial in the artificial intelligence industry. Service platforms should protect data providers' legitimate rights and fairly reward them for their contributions. Shapley value, a potent tool for evaluating contributions, outperforms other methods in theory, but its computational overhead escalates exponentially with the number of data providers. Recent works based on Shapley values attempt to mitigate computation complexity by approximation algorithms. However, they need to retrain for each test sample, leading to intolerable costs. We propose Fast-DataShapley, a one-pass training method that leverages the weighted least squares characterization of the Shapley value to train a reusable explainer model with real-time reasoning speed. Given new test samples, no retraining is required to calculate the Shapley values of the training data. Additionally, we propose three methods with theoretical guarantees to reduce training overhead from two aspects: the approximate calculation of the utility function and the group calculation of the training data. We analyze time complexity to show the efficiency of our methods. The experimental evaluations on various image datasets demonstrate superior performance and efficiency compared to baselines. Specifically, the performance is improved to more than 2.5 times, and the explainer's training speed can be increased by two orders of magnitude.
Related papers
- Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value [18.858879113762917]
We propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently.<n>Our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data.<n>This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.
arXiv Detail & Related papers (2025-05-22T02:46:03Z) - Same accuracy, twice as fast: continuous training surpasses retraining from scratch [40.678628069564745]
Continual learning aims to enable models to adapt to new datasets without losing performance on previously learned data.<n>In some cases, good performance on both datasets is achieved by abandoning the model trained on the previous data and re-training a new model from scratch on both datasets.<n>Our evaluation framework quantifies the computational savings of such methods while maintaining or exceeding the performance of training from scratch.
arXiv Detail & Related papers (2025-02-28T15:28:12Z) - OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruction-tuning models leads to an imbalanced computation load across different devices.<n>We rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices.<n>Our method's efficacy and generalizability are further validated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Thresholding Data Shapley for Data Cleansing Using Multi-Armed Bandits [7.335578524351567]
Data cleansing aims to improve model performance by removing a set of harmful instances from the training dataset.
Data Shapley is a common theoretically guaranteed method to evaluate the contribution of each instance to model performance.
We propose an iterativemethod to fast identify a subset of instances with low data Shapley values by using the thresholding bandit algorithm.
arXiv Detail & Related papers (2024-02-13T04:17:48Z) - Accelerated Shapley Value Approximation for Data Evaluation [3.707457963532597]
We show that Shapley value of data points can be approximated more efficiently by leveraging structural properties of machine learning problems.
Our analysis suggests that in fact models trained on small subsets are more important in context of data valuation.
arXiv Detail & Related papers (2023-11-09T13:15:36Z) - Fast Shapley Value Estimation: A Unified Approach [71.92014859992263]
We propose a straightforward and efficient Shapley estimator, SimSHAP, by eliminating redundant techniques.
In our analysis of existing approaches, we observe that estimators can be unified as a linear transformation of randomly summed values from feature subsets.
Our experiments validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
arXiv Detail & Related papers (2023-11-02T06:09:24Z) - Benchmarking Neural Network Training Algorithms [52.890134877995195]
Training algorithms are an essential part of every deep learning pipeline.<n>As a community, we are unable to reliably identify training algorithm improvements.<n>We introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware.
arXiv Detail & Related papers (2023-06-12T15:21:02Z) - Optimizing Data Shapley Interaction Calculation from O(2^n) to O(t n^2)
for KNN models [2.365702128814616]
"STI-KNN" is an innovative algorithm that calculates the exact pair-interaction Shapley values for KNN models in O(t n2) time.
By using STI-KNN, we can efficiently and accurately evaluate the value of individual data points, leading to improved training outcomes and ultimately enhancing the effectiveness of artificial intelligence applications.
arXiv Detail & Related papers (2023-04-02T06:15:19Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.