CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
- URL: http://arxiv.org/abs/2407.17467v2
- Date: Mon, 7 Oct 2024 05:16:25 GMT
- Title: CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
- Authors: Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan,
- Abstract summary: Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus.
The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen forgettingally, leading to sub-optimal training efficiency in practice.
We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data.
- Score: 9.661578977988743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model's general ability and achieves the desired domain transfer, ensuring the highest utilization of available resources. Considering the balance between efficiency and effectiveness, CMR can be regarded as the optimal mixture ratio. Through extensive experiments, we ascertain the predictability of CMR, propose CMR scaling law and have substantiated its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and domain-specific performance while efficiently managing training resources.
Related papers
- Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models [53.622682408251755]
We propose the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs.
Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios.
We also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law.
arXiv Detail & Related papers (2024-06-03T14:40:31Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - On the Convergence of Zeroth-Order Federated Tuning for Large Language Models [36.277423093218275]
Federated Learning and Large Language Models (LLMs) are ushering in a new era in privacy-preserving natural language processing.
Memory-efficient Zeroth-Order Optimization is a synergy we term as FedMeZO.
Our study is the first to examine the theoretical underpinnings of FedMeZO in the context of LLMs.
arXiv Detail & Related papers (2024-02-08T18:56:40Z) - COPR: Continual Learning Human Preference through Optimal Policy Regularization [32.54658750353585]
We propose a new method called Continual Optimal Policy Regularization (COPR)
COPR involves a single learning phase and doesn't necessitate complex reinforcement learning.
Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines.
arXiv Detail & Related papers (2023-10-24T10:05:32Z) - DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE)
In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture.
DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z) - Specificity-Preserving Federated Learning for MR Image Reconstruction [94.58912814426122]
Federated learning can be used to improve data privacy and efficiency in magnetic resonance (MR) image reconstruction.
Recent FL techniques tend to solve this by enhancing the generalization of the global model.
We propose a specificity-preserving FL algorithm for MR image reconstruction (FedMRI)
arXiv Detail & Related papers (2021-12-09T22:13:35Z) - Cross-Domain Sentiment Classification with Contrastive Learning and
Mutual Information Maximization [48.41392004071199]
We propose CLIM: Contrastive Learning with mutual Information Maximization, to explore the potential of CL on cross-domain sentiment classification.
Due to scarcity of labels on the target domain, we introduce mutual information (MIM) apart from CL to exploit the features that best support the final prediction.
We achieve new state-of-the-art results on the Amazon-review dataset as well as the airlines dataset, showing the efficacy of our proposed method CLIM.
arXiv Detail & Related papers (2020-10-30T06:12:01Z) - Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML.
We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML.
Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.