Related papers: D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

URL: http://arxiv.org/abs/2406.01375v1
Date: Mon, 3 Jun 2024 14:40:31 GMT
Title: D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
Authors: Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng,
Abstract summary: We propose the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios. We also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law.
Score: 53.622682408251755
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.

Related papers

PIPA: Preference Alignment as Prior-Informed Statistical Estimation [57.24096291517857]
We introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework. PIPA accommodates both paired and unpaired data, as well as answer and step-level annotations. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N.
arXiv Detail & Related papers (2025-02-09T04:31:30Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
The interplay between domain specialization and model size [8.653321928148547]
We investigate the interplay between domain and model size during continued pretraining under compute-constrained scenarios. Our goal is to identify an optimal training regime for this scenario and detect patterns in this interplay that can be generalized across different model sizes and domains.
arXiv Detail & Related papers (2025-01-03T19:28:53Z)
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition [39.073835841717184]
Cross-domain action recognition (CDFSAR) has attracted recent research interests. This paper proposes a simple yet effective baseline, namely Temporal-Aware Model Tuning (TAMT) for CDFSAR. Our TAMT involves a decoupled paradigm by performing pre-training on source data and fine-tuning target data, which avoids retraining for multiple target data with single source.
arXiv Detail & Related papers (2024-11-28T10:38:05Z)
Aligning CodeLLMs with Direct Preference Optimization [44.34483822102872]
This work first identifies that the commonly used PPO algorithm may be suboptimal for the alignment of CodeLLM. Based on only preference data pairs, DPO can render the model rank data automatically, giving rise to a fine-grained rewarding pattern. Studies show that our method significantly improves the performance of existing CodeLLMs on benchmarks such as MBPP and HumanEval.
arXiv Detail & Related papers (2024-10-24T09:36:13Z)
CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models [9.661578977988743]
Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen forgettingally, leading to sub-optimal training efficiency in practice. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data.
arXiv Detail & Related papers (2024-07-24T17:59:02Z)
Scaling Laws for Downstream Task Performance of Large Language Models [28.904224842085064]
We study how the choice of the pretraining data affects downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data.
arXiv Detail & Related papers (2024-02-06T17:31:20Z)
DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE) In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z)
FIXED: Frustratingly Easy Domain Generalization with Mixup [53.782029033068675]
Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains. A popular strategy is to augment training data to benefit generalization through methods such as Mixupcitezhang 2018mixup. We propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX) Our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5% on average in terms of test accuracy.
arXiv Detail & Related papers (2022-11-07T09:38:34Z)
Domain-Specific Risk Minimization for Out-of-Distribution Generalization [104.17683265084757]
We first establish a generalization bound that explicitly considers the adaptivity gap. We propose effective gap estimation methods for guiding the selection of a better hypothesis for the target. The other method is minimizing the gap directly by adapting model parameters using online target samples.
arXiv Detail & Related papers (2022-08-18T06:42:49Z)
LAMA-Net: Unsupervised Domain Adaptation via Latent Alignment and Manifold Learning for RUL Prediction [0.0]
We propose textitLAMA-Net, an encoder-decoder based model (Transformer) with an induced bottleneck, Latent Alignment using Mean Maximum Discrepancy (MMD) and manifold learning. The proposed method offers a promising approach to perform domain adaptation in RUL prediction.
arXiv Detail & Related papers (2022-08-17T16:28:20Z)
Disentangled Modeling of Domain and Relevance for Adaptable Dense Retrieval [54.349418995689284]
We propose a novel Dense Retrieval (DR) framework named Disentangled Dense Retrieval ( DDR) to support effective domain adaptation for DR models. By making the REM and DAMs disentangled, DDR enables a flexible training paradigm in which REM is trained with supervision once and DAMs are trained with unsupervised data. DDR significantly improves ranking performance compared to strong DR baselines and substantially outperforms traditional retrieval methods in most scenarios.
arXiv Detail & Related papers (2022-08-11T11:18:50Z)
Model-Based Domain Generalization [96.84818110323518]
We propose a novel approach for the domain generalization problem called Model-Based Domain Generalization. Our algorithms beat the current state-of-the-art methods on the very-recently-proposed WILDS benchmark by up to 20 percentage points.
arXiv Detail & Related papers (2021-02-23T00:59:02Z)
Rethinking Distributional Matching Based Domain Adaptation [111.15106414932413]
Domain adaptation (DA) is a technique that transfers predictive models trained on a labeled source domain to an unlabeled target domain. Most popular DA algorithms are based on distributional matching (DM) In this paper, we first systematically analyze the limitations of DM based methods, and then build new benchmarks with more realistic domain shifts.
arXiv Detail & Related papers (2020-06-23T21:55:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.