What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
- URL: http://arxiv.org/abs/2405.13954v1
- Date: Wed, 22 May 2024 19:39:05 GMT
- Title: What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
- Authors: Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, Eric Xing,
- Abstract summary: Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited.
In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability.
We also introduce LogIX, a software package that can transform existing training code into data valuation code with minimal effort.
- Score: 34.99034454081842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.
Related papers
- Data value estimation on private gradients [84.966853523107]
For gradient-based machine learning (ML) methods, the de facto differential privacy technique is perturbing the gradients with random noise.
Data valuation attributes the ML performance to the training data and is widely used in privacy-aware applications that require enforcing DP.
We show that the answer is no with the default approach of injecting i.i.d.random noise to the gradients because the estimation uncertainty of the data value estimation paradoxically linearly scales with more estimation budget.
We propose to instead inject carefully correlated noise to provably remove the linear scaling of estimation uncertainty w.r.t.the budget.
arXiv Detail & Related papers (2024-12-22T13:15:51Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV)
Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state.
In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value [17.340091573913316]
We propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate.
Data-OOB takes less than 2.25 hours on a single CPU processor when there are $106$ samples to evaluate and the input dimension is 100.
We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points.
arXiv Detail & Related papers (2023-04-16T08:03:58Z) - Fairness-Aware Data Valuation for Supervised Learning [4.874780144224057]
We propose Fairness-Aware Data vauatiOn (FADO) to incorporate fairness concerns into a series of ML-related tasks.
We show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques.
Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline.
arXiv Detail & Related papers (2023-03-29T18:51:13Z) - Experimenting with an Evaluation Framework for Imbalanced Data Learning
(EFIDL) [9.010643838773477]
Data imbalance is one of the crucial issues in big data analysis with fewer labels.
Many data balance methods were introduced to improve machine learning algorithms' performance.
We proposed, a new evaluation framework for imbalanced data learning methods.
arXiv Detail & Related papers (2023-01-26T01:16:02Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - Graph Backup: Data Efficient Backup Exploiting Markovian Transitions [24.765707880860543]
A key to data-efficient RL is good value estimation, but current methods fail to fully utilise the structure of the trajectory data gathered from the environment.
In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation.
Our method, when combined with popular value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks.
arXiv Detail & Related papers (2022-05-31T14:26:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.