Checkpoint Merging via Bayesian Optimization in LLM Pretraining
- URL: http://arxiv.org/abs/2403.19390v1
- Date: Thu, 28 Mar 2024 13:01:18 GMT
- Title: Checkpoint Merging via Bayesian Optimization in LLM Pretraining
- Authors: Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui,
- Abstract summary: We propose checkpoint merging in pretraining large language models (LLMs)
Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost.
- Score: 10.743581503931523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid proliferation of large language models (LLMs) such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
Related papers
- A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs [74.35290684163718]
A primary challenge in large language model (LLM) development is their onerous pre-training cost.
This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM)
arXiv Detail & Related papers (2024-10-24T14:31:52Z) - Efficient Reinforcement Learning with Large Language Model Priors [18.72288751305885]
Large language models (LLMs) have recently emerged as powerful general-purpose tools.
We propose treating LLMs as prior action distributions and integrating them into RL frameworks.
We show that incorporating LLM-based action priors significantly reduces exploration and complexity optimization.
arXiv Detail & Related papers (2024-10-10T13:54:11Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Exploring Scaling Laws for Local SGD in Large Language Model Training [4.125418728284004]
We show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources.
This demonstrates its viability as an alternative to single large-cluster training.
arXiv Detail & Related papers (2024-09-20T04:02:48Z) - Achieving Peak Performance for Large Language Models: A Systematic Review [0.0]
Large language models (LLMs) have achieved remarkable success in natural language processing (NLP)
As models grow into the trillion- parameter range, computational and memory costs increase significantly.
This makes it difficult for many researchers to access the resources needed to train or apply these models.
arXiv Detail & Related papers (2024-09-07T13:57:41Z) - LMGT: Optimizing Exploration-Exploitation Balance in Reinforcement Learning through Language Model Guided Trade-offs [27.014415210732103]
We introduce textbfLanguage textbfModel textbfGuided textbfTrade-offs (i.e., textbfLMGT), a novel, sample-efficient framework for Reinforcement Learning.
arXiv Detail & Related papers (2024-09-07T07:40:43Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks.
Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information.
This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z) - A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions.
Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data.
Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.