On Training Data Influence of GPT Models
- URL: http://arxiv.org/abs/2404.07840v3
- Date: Thu, 03 Oct 2024 17:56:12 GMT
- Title: On Training Data Influence of GPT Models
- Authors: Yekun Chai, Qingyi Liu, Shuohuan Wang, Yu Sun, Qiwei Peng, Hua Wu,
- Abstract summary: GPTfluence is a novel approach to assess the impact of training examples on the training dynamics of GPT models.
Our approach traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points.
- Score: 37.53037752668756
- License:
- Abstract: Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence.
Related papers
- Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Feasible Learning [78.6167929413604]
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample.
Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
arXiv Detail & Related papers (2025-01-24T20:39:38Z) - Experience of Training a 1.7B-Parameter LLaMa Model From Scratch [10.39475177812483]
We share insights gained from training DMaS-LLaMa-Lite on approximately 20 billion tokens of data.
We chronicle the full training trajectory, documenting how evolving validation loss levels and downstream benchmarks reflect transitions from incoherent text to fluent, contextually grounded output.
By detailing these experiences and offering training logs, checkpoints, and sample outputs, we aim to guide future researchers and practitioners in refining their pretraining strategies.
arXiv Detail & Related papers (2024-12-17T21:15:52Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - Exploring the Impact of Instruction Data Scaling on Large Language
Models: An Empirical Study on Real-World Use Cases [17.431381376675432]
In this paper we explore the performance of large language models based on instruction tuning across different scales of instruction data.
With Bloomz-7B1-mt as the base model, the results show that merely increasing the amount of instruction data leads to continuous improvement in tasks such as open-ended generation.
We propose potential future research directions such as effectively selecting high-quality training data, scaling base models and training methods specialized for hard tasks.
arXiv Detail & Related papers (2023-03-26T14:49:37Z) - Simfluence: Modeling the Influence of Individual Training Examples by
Simulating Training Runs [27.314239745883967]
Training data attribution (TDA) methods trace a model's prediction on any given example back to specific influential training examples.
We propose Simfluence, a new paradigm for TDA where the goal is not to produce a single influence score per example, but instead a training run simulator.
Simfluence captures non-additive interactions and is often able to predict the spiky trajectory of individual example losses with surprising fidelity.
arXiv Detail & Related papers (2023-03-14T17:47:25Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.