CPM-2: Large-scale Cost-effective Pre-trained Language Models
- URL: http://arxiv.org/abs/2106.10715v3
- Date: Thu, 24 Jun 2021 13:23:42 GMT
- Title: CPM-2: Large-scale Cost-effective Pre-trained Language Models
- Authors: Zhengyan Zhang, Yuxian Gu, Xu Han, Shengqi Chen, Chaojun Xiao, Zhenbo
Sun, Yuan Yao, Fanchao Qi, Jian Guan, Pei Ke, Yanzheng Cai, Guoyang Zeng,
Zhixing Tan, Zhiyuan Liu, Minlie Huang, Wentao Han, Yang Liu, Xiaoyan Zhu,
Maosong Sun
- Abstract summary: We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
- Score: 71.59893315671997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the size of pre-trained language models (PLMs) has grown by
leaps and bounds. However, efficiency issues of these large-scale PLMs limit
their utilization in real-world scenarios. We present a suite of cost-effective
techniques for the use of PLMs to deal with the efficiency issues of
pre-training, fine-tuning, and inference. (1) We introduce knowledge
inheritance to accelerate the pre-training process by exploiting existing PLMs
instead of training models from scratch. (2) We explore the best practice of
prompt tuning with large-scale PLMs. Compared with conventional fine-tuning,
prompt tuning significantly reduces the number of task-specific parameters. (3)
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs
with limited computational resources. Based on our cost-effective pipeline, we
pre-train two models: an encoder-decoder bilingual model with 11 billion
parameters (CPM-2) and its corresponding MoE version with 198 billion
parameters. In our experiments, we compare CPM-2 with mT5 on downstream tasks.
Experimental results show that CPM-2 has excellent general language
intelligence. Moreover, we validate the efficiency of InfMoE when conducting
inference of large-scale models having tens of billions of parameters on a
single GPU. All source code and model parameters are available at
https://github.com/TsinghuaAI/CPM.
Related papers
- CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M.
We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency.
Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z) - EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models [25.058673320372677]
Large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes.
This work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models.
We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary.
arXiv Detail & Related papers (2025-02-10T16:51:03Z) - MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies [85.57899012821211]
Small Language Models (SLMs) are a resource-efficient alternative to Large Language Models (LLMs)
We introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants.
We also introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K.
arXiv Detail & Related papers (2024-04-09T15:36:50Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Large Product Key Memory for Pretrained Language Models [12.932177565788974]
Product key memory (PKM) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead.
Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be fine for a wide variety of downstream NLP tasks.
arXiv Detail & Related papers (2020-10-08T10:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.