Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization
- URL: http://arxiv.org/abs/2402.14270v2
- Date: Fri, 1 Mar 2024 15:21:16 GMT
- Title: Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization
- Authors: Xuxi Chen, Zhendong Wang, Daouda Sow, Junjie Yang, Tianlong Chen,
Yingbin Liang, Mingyuan Zhou, Zhangyang Wang
- Abstract summary: A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
- Score: 165.98557106089777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the rapidly advancing arena of large language models (LLMs), a key
challenge is to enhance their capabilities amid a looming shortage of
high-quality training data. Our study starts from an empirical strategy for the
light continual training of LLMs using their original pre-training data sets,
with a specific focus on selective retention of samples that incur moderately
high losses. These samples are deemed informative and beneficial for model
refinement, contrasting with the highest-loss samples, which would be discarded
due to their correlation with data noise and complexity. We then formalize this
strategy into a principled framework of Instance-Reweighted Distributionally
Robust Optimization (IR-DRO). IR-DRO is designed to dynamically prioritize the
training focus on informative samples through an instance reweighting
mechanism, streamlined by a closed-form solution for straightforward
integration into established training protocols. Through rigorous
experimentation with various models and datasets, our findings indicate that
our sample-targeted methods significantly improve LLM performance across
multiple benchmarks, in both continual pre-training and instruction tuning
scenarios. Our codes are available at
https://github.com/VITA-Group/HardFocusTraining.
Related papers
- Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - Feasible Learning [78.6167929413604]
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample.
Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
arXiv Detail & Related papers (2025-01-24T20:39:38Z) - E2EDiff: Direct Mapping from Noise to Data for Enhanced Diffusion Models [15.270657838960114]
Diffusion models have emerged as a powerful framework for generative modeling, achieving state-of-the-art performance across various tasks.
They face several inherent limitations, including a training-sampling gap, information leakage in the progressive noising process, and the inability to incorporate advanced loss functions like perceptual and adversarial losses during training.
We propose an innovative end-to-end training framework that aligns the training and sampling processes by directly optimizing the final reconstruction output.
arXiv Detail & Related papers (2024-12-30T16:06:31Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.