D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
- URL: http://arxiv.org/abs/2406.01375v1
- Date: Mon, 3 Jun 2024 14:40:31 GMT
- Title: D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
- Authors: Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng,
- Abstract summary: We propose the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs.
Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios.
We also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law.
- Score: 53.622682408251755
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.
Related papers
- PIPA: Preference Alignment as Prior-Informed Statistical Estimation [57.24096291517857]
We introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework.
PIPA accommodates both paired and unpaired data, as well as answer and step-level annotations.
By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N.
arXiv Detail & Related papers (2025-02-09T04:31:30Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.
We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.
We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - The interplay between domain specialization and model size: a case study in the legal domain [8.653321928148547]
We investigate the interplay between domain and model size during continual pre-training under compute-constrained scenarios.
Our goal is to identify a compute-efficient training regime for this scenario.
As model size increases, the compute-effectiveness gap between specialized and general models widens.
arXiv Detail & Related papers (2025-01-03T19:28:53Z) - TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition [39.073835841717184]
Cross-domain action recognition (CDFSAR) has attracted recent research interests.
This paper proposes a simple yet effective baseline, namely Temporal-Aware Model Tuning (TAMT) for CDFSAR.
Our TAMT involves a decoupled paradigm by performing pre-training on source data and fine-tuning target data, which avoids retraining for multiple target data with single source.
arXiv Detail & Related papers (2024-11-28T10:38:05Z) - CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models [9.661578977988743]
Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus.
The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen forgettingally, leading to sub-optimal training efficiency in practice.
We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data.
arXiv Detail & Related papers (2024-07-24T17:59:02Z) - DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE)
In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture.
DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z) - FIXED: Frustratingly Easy Domain Generalization with Mixup [53.782029033068675]
Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains.
A popular strategy is to augment training data to benefit generalization through methods such as Mixupcitezhang 2018mixup.
We propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX)
Our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5% on average in terms of test accuracy.
arXiv Detail & Related papers (2022-11-07T09:38:34Z) - LAMA-Net: Unsupervised Domain Adaptation via Latent Alignment and
Manifold Learning for RUL Prediction [0.0]
We propose textitLAMA-Net, an encoder-decoder based model (Transformer) with an induced bottleneck, Latent Alignment using Mean Maximum Discrepancy (MMD) and manifold learning.
The proposed method offers a promising approach to perform domain adaptation in RUL prediction.
arXiv Detail & Related papers (2022-08-17T16:28:20Z) - Disentangled Modeling of Domain and Relevance for Adaptable Dense
Retrieval [54.349418995689284]
We propose a novel Dense Retrieval (DR) framework named Disentangled Dense Retrieval ( DDR) to support effective domain adaptation for DR models.
By making the REM and DAMs disentangled, DDR enables a flexible training paradigm in which REM is trained with supervision once and DAMs are trained with unsupervised data.
DDR significantly improves ranking performance compared to strong DR baselines and substantially outperforms traditional retrieval methods in most scenarios.
arXiv Detail & Related papers (2022-08-11T11:18:50Z) - Rethinking Distributional Matching Based Domain Adaptation [111.15106414932413]
Domain adaptation (DA) is a technique that transfers predictive models trained on a labeled source domain to an unlabeled target domain.
Most popular DA algorithms are based on distributional matching (DM)
In this paper, we first systematically analyze the limitations of DM based methods, and then build new benchmarks with more realistic domain shifts.
arXiv Detail & Related papers (2020-06-23T21:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.