Related papers: A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models

A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models

URL: http://arxiv.org/abs/2410.04103v1
Date: Sat, 5 Oct 2024 10:15:48 GMT
Title: A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models
Authors: Zhihao Wang, Shiyu Liu, Jianheng Huang, Zheng Wang, Yixuan Liao, Xiaoxin Chen, Junfeng Yao, Jinsong Su,
Abstract summary: Training paradigms for version updates of Large Language Models (LLMs) include pre-training from scratch (PTFS) and continual pre-training (CPT) Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data.
Score: 35.44133682914159
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Moreover, their performance and training cost gaps widen progressively with version updates. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and continual pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning rate path switching training paradigm. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data. Extensive experiments demonstrate the effectiveness and generalization of our paradigm. Particularly, when training four versions of LLMs, our paradigm reduces the total training cost to 58% compared to PTFS, while maintaining comparable pre-training performance.

Related papers

Memorization vs. Reasoning: Updating LLMs with New Knowledge [12.214561228023511]
We introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates. We present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated "memory" tokens during training. Our results show that (1) KUP benchmark is highly challenging, with the best CPT models achieving $2%$ in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines.
arXiv Detail & Related papers (2025-04-16T23:03:40Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs [74.35290684163718]
A primary challenge in large language model (LLM) development is their onerous pre-training cost. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM)
arXiv Detail & Related papers (2024-10-24T14:31:52Z)
Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs [4.096028601599825]
Large Language Models (LLMs) for public use require continuous pre-training to remain up-to-date with the latest data. This study aims to find the most compute-efficient strategy to gain up-to-date knowledge and instruction-following capabilities without requiring any instruction data and fine-tuning.
arXiv Detail & Related papers (2024-10-14T17:20:30Z)
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs) MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
Model Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to expedite alignment training with human preferences. We demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one. We show that ExPO notably improves existing open-source LLMs on the leading AlpacaEval 2.0 and MT-Bench benchmarks.
arXiv Detail & Related papers (2024-04-25T17:39:50Z)
InternLM2 Technical Report [159.70692271378581]
This paper introduces InternLM2, an open-source Large Language Models (LLMs) that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-03-26T00:53:24Z)
Boosting Meta-Training with Base Class Information for Few-Shot Learning [35.144099160883606]
We propose an end-to-end training paradigm consisting of two alternative loops. In the outer loop, we calculate cross entropy loss on the entire training set while updating only the final linear layer. This training paradigm not only converges quickly but also outperforms existing baselines, indicating that information from the overall training set and the meta-learning training paradigm could mutually reinforce one another.
arXiv Detail & Related papers (2024-03-06T05:13:23Z)
Continual Learning for Large Language Models: A Survey [95.79977915131145]
Large language models (LLMs) are not amenable to frequent re-training, due to high training costs arising from their massive scale. This paper surveys recent works on continual learning for LLMs.
arXiv Detail & Related papers (2024-02-02T12:34:09Z)
FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs [6.689848416609951]
We study the interplay between unlearning and fairness for large language models (LLMs) We focus on a popular unlearning framework known as SISA, which creates an ensemble of models trained on disjoint shards. We propose post-processing bias mitigation techniques for ensemble models produced by SISA.
arXiv Detail & Related papers (2023-12-12T16:44:47Z)
Flipped Classroom: Effective Teaching for Time Series Forecasting [0.0]
Sequence-to-sequence models based on LSTM and GRU are a most popular choice for forecasting time series data. The two most common training strategies within this context are teacher forcing (TF) and free running (FR) We propose several new curricula, and systematically evaluate their performance in two experimental sets.
arXiv Detail & Related papers (2022-10-17T11:53:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.