Related papers: Data Shapley in One Training Run

Data Shapley in One Training Run

URL: http://arxiv.org/abs/2406.11011v2
Date: Sat, 29 Jun 2024 23:05:32 GMT
Title: Data Shapley in One Training Run
Authors: Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia,
Abstract summary: Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
Score: 88.59484417202454
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

Related papers

Model State Arithmetic for Machine Unlearning [43.773053236733425]
We propose a new algorithm, MSA, for estimating and undoing the influence of datapoints.<n>Our experimental results demonstrate that MSA consistently outperforms existing machine unlearning algorithms.
arXiv Detail & Related papers (2025-06-26T02:16:16Z)
Effective Data Pruning through Score Extrapolation [40.61665742457229]
We introduce a novel importance score extrapolation framework that requires training on only a small subset of data.<n>We present two initial approaches in this framework to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset.<n>Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks.
arXiv Detail & Related papers (2025-06-10T17:38:49Z)
Optimize Cardinality Estimation Model Pretraining by Simplifying the Training Datasets [0.0]
We introduce a simplified training dataset, which has been reduced to a fraction of the size of existing pretraining datasets. Sufficient experimental results demonstrate that the pre-trained cardinality estimator based on this simplified dataset can still achieve comparable performance to existing models in zero-shot setups.
arXiv Detail & Related papers (2025-02-20T08:06:16Z)
Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection [37.65064631532493]
Finetuning a pretrained model to perform unsupervised prediction on data from a target domain presents two challenges. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
arXiv Detail & Related papers (2025-02-09T21:44:27Z)
The interplay between domain specialization and model size [8.653321928148547]
We investigate the interplay between domain and model size during continued pretraining under compute-constrained scenarios. Our goal is to identify an optimal training regime for this scenario and detect patterns in this interplay that can be generalized across different model sizes and domains.
arXiv Detail & Related papers (2025-01-03T19:28:53Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
Provable unlearning in topic modeling and downstream tasks [36.571324268874264]
Provable guarantees for unlearning are often limited to supervised learning settings. We provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm. We show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.
arXiv Detail & Related papers (2024-11-19T16:04:31Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging [27.445079398772904]
We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus. We find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data.
arXiv Detail & Related papers (2024-10-21T06:03:49Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Phased Data Augmentation for Training a Likelihood-Based Generative Model with Limited Data [0.0]
Generative models excel in creating realistic images, yet their dependency on extensive datasets for training presents significant challenges. Current data-efficient methods largely focus on GAN architectures, leaving a gap in training other types of generative models. "phased data augmentation" is a novel technique that addresses this gap by optimizing training in limited data scenarios without altering the inherent data distribution.
arXiv Detail & Related papers (2023-05-22T03:38:59Z)
Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models. We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models. Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z)
Imputing Knowledge Tracing Data with Subject-Based Training via LSTM Variational Autoencoders Frameworks [6.24828623162058]
We adopt a subject-based training method to split and impute data by student IDs instead of row number splitting. We leverage two existing deep generative frameworks, namely variational Autoencoders (VAE) and Longitudinal Variational Autoencoders (LVAE) We demonstrate that the generated data from LSTM-VAE and LSTM-LVAE can boost the original model performance by about 50%.
arXiv Detail & Related papers (2023-02-24T21:56:03Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Self Training with Ensemble of Teacher Models [8.257085583227695]
In order to train robust deep learning models, large amounts of labelled data is required. In the absence of such large repositories of labelled data, unlabeled data can be exploited for the same. Semi-Supervised learning aims to utilize such unlabeled data for training classification models.
arXiv Detail & Related papers (2021-07-17T09:44:09Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.