DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
Diffusion Models
- URL: http://arxiv.org/abs/2310.00902v3
- Date: Wed, 13 Mar 2024 14:27:46 GMT
- Title: DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
Diffusion Models
- Authors: Yongchan Kwon, Eric Wu, Kevin Wu, James Zou
- Abstract summary: We propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models.
Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA.
In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores.
- Score: 31.65198592956842
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Quantifying the impact of training data points is crucial for understanding
the outputs of machine learning models and for improving the transparency of
the AI pipeline. The influence function is a principled and popular data
attribution method, but its computational cost often makes it challenging to
use. This issue becomes more pronounced in the setting of large language models
and text-to-image models. In this work, we propose DataInf, an efficient
influence approximation method that is practical for large-scale generative AI
models. Leveraging an easy-to-compute closed-form expression, DataInf
outperforms existing influence computation algorithms in terms of computational
and memory efficiency. Our theoretical analysis shows that DataInf is
particularly well-suited for parameter-efficient fine-tuning techniques such as
LoRA. Through systematic empirical evaluations, we show that DataInf accurately
approximates influence scores and is orders of magnitude faster than existing
methods. In applications to RoBERTa-large, Llama-2-13B-chat, and
stable-diffusion-v1.5 models, DataInf effectively identifies the most
influential fine-tuning examples better than other approximate influence
scores. Moreover, it can help to identify which data points are mislabeled.
Related papers
- HyperINF: Unleashing the HyperPower of the Schulz's Method for Data Influence Estimation [37.62285675595782]
We propose HyperINF, an efficient and accurate influence function approximation method.
We incorporate the generalized fisher information (GFIM) as a low-rank approximation of the Hessian matrix.
On LoRA-tuned models, HyperINF achieves superior downstream performance with minimal memory and computational overhead.
arXiv Detail & Related papers (2024-10-07T14:42:45Z) - Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models [43.26028399395612]
We propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods.
First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process.
Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA.
arXiv Detail & Related papers (2024-09-30T18:12:18Z) - Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning [19.962212551963383]
Active Learning (AL) allows models to learn interactively from user feedback.
This paper introduces a counterfactual data augmentation approach to AL.
arXiv Detail & Related papers (2024-08-07T14:55:04Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - EraseDiff: Erasing Data Influence in Diffusion Models [51.225365010401006]
We introduce EraseDiff, an unlearning algorithm to address concerns related to data memorization.
Our approach formulates the unlearning task as a constrained optimization problem.
We show that EraseDiff effectively preserves the model's utility, efficacy, and efficiency.
arXiv Detail & Related papers (2024-01-11T09:30:36Z) - Studying Large Language Model Generalization with Influence Functions [29.577692176892135]
Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a sequence were added to the training set?
We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to large language models (LLMs) with up to 52 billion parameters.
We investigate generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior.
arXiv Detail & Related papers (2023-08-07T04:47:42Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - FastIF: Scalable Influence Functions for Efficient Model Interpretation
and Debugging [112.19994766375231]
Influence functions approximate the 'influences' of training data-points for test predictions.
We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time.
Our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors.
arXiv Detail & Related papers (2020-12-31T18:02:34Z) - Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions.
influence estimates are fairly accurate for shallow networks.
Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.