Data Shapley in One Training Run
- URL: http://arxiv.org/abs/2406.11011v2
- Date: Sat, 29 Jun 2024 23:05:32 GMT
- Title: Data Shapley in One Training Run
- Authors: Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia,
- Abstract summary: Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
- Score: 88.59484417202454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.
Related papers
- Phased Data Augmentation for Training a Likelihood-Based Generative Model with Limited Data [0.0]
Generative models excel in creating realistic images, yet their dependency on extensive datasets for training presents significant challenges.
Current data-efficient methods largely focus on GAN architectures, leaving a gap in training other types of generative models.
"phased data augmentation" is a novel technique that addresses this gap by optimizing training in limited data scenarios without altering the inherent data distribution.
arXiv Detail & Related papers (2023-05-22T03:38:59Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - Imputing Knowledge Tracing Data with Subject-Based Training via LSTM
Variational Autoencoders Frameworks [6.24828623162058]
We adopt a subject-based training method to split and impute data by student IDs instead of row number splitting.
We leverage two existing deep generative frameworks, namely variational Autoencoders (VAE) and Longitudinal Variational Autoencoders (LVAE)
We demonstrate that the generated data from LSTM-VAE and LSTM-LVAE can boost the original model performance by about 50%.
arXiv Detail & Related papers (2023-02-24T21:56:03Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Building Resilience to Out-of-Distribution Visual Data via Input
Optimization and Model Finetuning [13.804184845195296]
We propose a preprocessing model that learns to optimise input data for a specific target vision model.
We investigate several out-of-distribution scenarios in the context of semantic segmentation for autonomous vehicles.
We demonstrate that our approach can enable performance on such data comparable to that of a finetuned model.
arXiv Detail & Related papers (2022-11-29T14:06:35Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - How to Learn when Data Gradually Reacts to Your Model [10.074466859579571]
We propose a new algorithm, Stateful Performative Gradient Descent (Stateful PerfGD), for minimizing the performative loss even in the presence of these effects.
Our experiments confirm that Stateful PerfGD substantially outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2021-12-13T22:05:26Z) - Self Training with Ensemble of Teacher Models [8.257085583227695]
In order to train robust deep learning models, large amounts of labelled data is required.
In the absence of such large repositories of labelled data, unlabeled data can be exploited for the same.
Semi-Supervised learning aims to utilize such unlabeled data for training classification models.
arXiv Detail & Related papers (2021-07-17T09:44:09Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.