Dynamic Gradient Alignment for Online Data Mixing
- URL: http://arxiv.org/abs/2410.02498v1
- Date: Thu, 3 Oct 2024 14:00:44 GMT
- Title: Dynamic Gradient Alignment for Online Data Mixing
- Authors: Simin Fan, David Grangier, Pierre Ablin,
- Abstract summary: The composition of training data mixtures is critical for effectively training large language models (LLMs)
Traditional approaches to this problem include ad-hoc reweighting methods, importance sampling, and gradient alignment techniques.
This paper focuses on gradient alignment and introduces Dynamic Gradient Alignment (DGA), a scalable online gradient alignment algorithm.
- Score: 23.774035761200746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM for a specific task with access to only a few examples. Traditional approaches to this problem include ad-hoc reweighting methods, importance sampling, and gradient alignment techniques. This paper focuses on gradient alignment and introduces Dynamic Gradient Alignment (DGA), a scalable online gradient alignment algorithm. DGA dynamically estimates the pre-training data mixture on which the models' gradients align as well as possible with those of the model on the specific task. DGA is the first gradient alignment approach that incurs minimal overhead compared to standard pre-training and outputs a competitive model, eliminating the need for retraining the model. Experimentally, we demonstrate significant improvements over importance sampling in two key scenarios: (i) when the pre-training set is small and importance sampling overfits due to limited data; and (ii) when there is insufficient specialized data, trapping importance sampling on narrow pockets of data. Our findings underscore the effectiveness of gradient alignment methods in optimizing training data mixtures, particularly in data-constrained environments, and offer a practical solution for enhancing LLM performance on specific tasks with limited data availability.
Related papers
- Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - LPLgrad: Optimizing Active Learning Through Gradient Norm Sample Selection and Auxiliary Model Training [2.762397703396293]
Loss Prediction Loss with Gradient Norm (LPLgrad) is designed to quantify model uncertainty effectively and improve the accuracy of image classification tasks.
LPLgrad operates in two distinct phases: (i) em Training Phase aims to predict the loss for input features by jointly training a main model and an auxiliary model.
This dual-model approach enhances the ability to extract complex input features and learn intrinsic patterns from the data effectively.
arXiv Detail & Related papers (2024-11-20T18:12:59Z) - Dataset Distillation by Automatic Training Trajectories [13.502303920701163]
We propose a new approach, Automatic Training Trajectories (ATT), which dynamically and adaptively adjusts trajectory length NS to address the Accumulated Mismatching Problem (AMP)
Our method outperforms existing methods particularly in tests involving cross-architectures.
arXiv Detail & Related papers (2024-07-19T12:27:11Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Online Importance Sampling for Stochastic Gradient Optimization [33.42221341526944]
We propose a practical algorithm that efficiently computes data importance on-the-fly during training.
We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling.
arXiv Detail & Related papers (2023-11-24T13:21:35Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Building Manufacturing Deep Learning Models with Minimal and Imbalanced
Training Data Using Domain Adaptation and Data Augmentation [15.333573151694576]
We propose a novel domain adaptation (DA) approach to address the problem of labeled training data scarcity for a target learning task.
Our approach works for scenarios where the source dataset and the dataset available for the target learning task have same or different feature spaces.
We evaluate our combined approach using image data for wafer defect prediction.
arXiv Detail & Related papers (2023-05-31T21:45:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.