Data Selection for Fine-tuning Large Language Models Using Transferred
Shapley Values
- URL: http://arxiv.org/abs/2306.10165v1
- Date: Fri, 16 Jun 2023 20:07:38 GMT
- Title: Data Selection for Fine-tuning Large Language Models Using Transferred
Shapley Values
- Authors: Stephanie Schoch, Ritwick Mishra, Yangfeng Ji
- Abstract summary: We propose TS-DShapley, an algorithm that reduces computational cost of Shapley-based data valuation.
Experiments applying TS-DShapley to select data for fine-tuning BERT-based language models on benchmark natural language understanding (NLU) datasets show that TS-DShapley outperforms existing data selection methods.
- Score: 10.53825744656208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Shapley values have been shown to be highly effective for
identifying harmful training instances, dataset size and model complexity
constraints limit the ability to apply Shapley-based data valuation to
fine-tuning large pre-trained language models. To address this, we propose
TS-DShapley, an algorithm that reduces computational cost of Shapley-based data
valuation through: 1) an efficient sampling-based method that aggregates
Shapley values computed from subsets for valuation of the entire training set,
and 2) a value transfer method that leverages value information extracted from
a simple classifier trained using representations from the target language
model. Our experiments applying TS-DShapley to select data for fine-tuning
BERT-based language models on benchmark natural language understanding (NLU)
datasets show that TS-DShapley outperforms existing data selection methods.
Further, TS-DShapley can filter fine-tuning data to increase language model
performance compared to training with the full fine-tuning dataset.
Related papers
- SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value [18.858879113762917]
We propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently.<n>Our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data.<n>This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.
arXiv Detail & Related papers (2025-05-22T02:46:03Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.
Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.
Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning [0.0]
We propose CHG Shapley, which approximates the utility of each data subset on model accuracy during a single model training.
We employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data.
arXiv Detail & Related papers (2024-06-17T16:48:31Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Thresholding Data Shapley for Data Cleansing Using Multi-Armed Bandits [7.335578524351567]
Data cleansing aims to improve model performance by removing a set of harmful instances from the training dataset.
Data Shapley is a common theoretically guaranteed method to evaluate the contribution of each instance to model performance.
We propose an iterativemethod to fast identify a subset of instances with low data Shapley values by using the thresholding bandit algorithm.
arXiv Detail & Related papers (2024-02-13T04:17:48Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Accelerated Shapley Value Approximation for Data Evaluation [3.707457963532597]
We show that Shapley value of data points can be approximated more efficiently by leveraging structural properties of machine learning problems.
Our analysis suggests that in fact models trained on small subsets are more important in context of data valuation.
arXiv Detail & Related papers (2023-11-09T13:15:36Z) - Efficient Shapley Values Estimation by Amortization for Text
Classification [66.7725354593271]
We develop an amortized model that directly predicts each input feature's Shapley Value without additional model evaluations.
Experimental results on two text classification datasets demonstrate that our amortized model estimates Shapley Values accurately with up to 60 times speedup.
arXiv Detail & Related papers (2023-05-31T16:19:13Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Process for Adapting Language Models to Society (PALMS) with
Values-Targeted Datasets [0.0]
Language models can generate harmful and biased outputs and exhibit undesirable behavior.
We propose a Process for Adapting Language Models to Society (PALMS) with Values-Targeted datasets.
We show that significantly adjusting language model behavior is feasible with a small, hand-curated dataset.
arXiv Detail & Related papers (2021-06-18T19:38:28Z) - Bayesian Active Learning with Pretrained Language Models [9.161353418331245]
Active Learning (AL) is a method to iteratively select data for annotation from a pool of unlabeled data.
Previous AL approaches have been limited to task-specific models that are trained from scratch at each iteration.
We introduce BALM; Bayesian Active Learning with pretrained language models.
arXiv Detail & Related papers (2021-04-16T19:07:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.