D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
- URL: http://arxiv.org/abs/2406.01375v1
- Date: Mon, 3 Jun 2024 14:40:31 GMT
- Title: D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
- Authors: Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng,
- Abstract summary: We propose the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs.
Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios.
We also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law.
- Score: 53.622682408251755
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.
Related papers
- Aligning CodeLLMs with Direct Preference Optimization [44.34483822102872]
This work first identifies that the commonly used PPO algorithm may be suboptimal for the alignment of CodeLLM.
Based on only preference data pairs, DPO can render the model rank data automatically, giving rise to a fine-grained rewarding pattern.
Studies show that our method significantly improves the performance of existing CodeLLMs on benchmarks such as MBPP and HumanEval.
arXiv Detail & Related papers (2024-10-24T09:36:13Z) - CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models [9.661578977988743]
Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus.
The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen forgettingally, leading to sub-optimal training efficiency in practice.
We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data.
arXiv Detail & Related papers (2024-07-24T17:59:02Z) - Scaling Laws for Downstream Task Performance of Large Language Models [28.904224842085064]
We study how the choice of the pretraining data affects downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score.
With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z) - DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE)
In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture.
DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z) - FIXED: Frustratingly Easy Domain Generalization with Mixup [53.782029033068675]
Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains.
A popular strategy is to augment training data to benefit generalization through methods such as Mixupcitezhang 2018mixup.
We propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX)
Our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5% on average in terms of test accuracy.
arXiv Detail & Related papers (2022-11-07T09:38:34Z) - Domain-Specific Risk Minimization for Out-of-Distribution Generalization [104.17683265084757]
We first establish a generalization bound that explicitly considers the adaptivity gap.
We propose effective gap estimation methods for guiding the selection of a better hypothesis for the target.
The other method is minimizing the gap directly by adapting model parameters using online target samples.
arXiv Detail & Related papers (2022-08-18T06:42:49Z) - LAMA-Net: Unsupervised Domain Adaptation via Latent Alignment and
Manifold Learning for RUL Prediction [0.0]
We propose textitLAMA-Net, an encoder-decoder based model (Transformer) with an induced bottleneck, Latent Alignment using Mean Maximum Discrepancy (MMD) and manifold learning.
The proposed method offers a promising approach to perform domain adaptation in RUL prediction.
arXiv Detail & Related papers (2022-08-17T16:28:20Z) - Disentangled Modeling of Domain and Relevance for Adaptable Dense
Retrieval [54.349418995689284]
We propose a novel Dense Retrieval (DR) framework named Disentangled Dense Retrieval ( DDR) to support effective domain adaptation for DR models.
By making the REM and DAMs disentangled, DDR enables a flexible training paradigm in which REM is trained with supervision once and DAMs are trained with unsupervised data.
DDR significantly improves ranking performance compared to strong DR baselines and substantially outperforms traditional retrieval methods in most scenarios.
arXiv Detail & Related papers (2022-08-11T11:18:50Z) - Model-Based Domain Generalization [96.84818110323518]
We propose a novel approach for the domain generalization problem called Model-Based Domain Generalization.
Our algorithms beat the current state-of-the-art methods on the very-recently-proposed WILDS benchmark by up to 20 percentage points.
arXiv Detail & Related papers (2021-02-23T00:59:02Z) - Rethinking Distributional Matching Based Domain Adaptation [111.15106414932413]
Domain adaptation (DA) is a technique that transfers predictive models trained on a labeled source domain to an unlabeled target domain.
Most popular DA algorithms are based on distributional matching (DM)
In this paper, we first systematically analyze the limitations of DM based methods, and then build new benchmarks with more realistic domain shifts.
arXiv Detail & Related papers (2020-06-23T21:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.