CPM-2: Large-scale Cost-effective Pre-trained Language Models
- URL: http://arxiv.org/abs/2106.10715v3
- Date: Thu, 24 Jun 2021 13:23:42 GMT
- Title: CPM-2: Large-scale Cost-effective Pre-trained Language Models
- Authors: Zhengyan Zhang, Yuxian Gu, Xu Han, Shengqi Chen, Chaojun Xiao, Zhenbo
Sun, Yuan Yao, Fanchao Qi, Jian Guan, Pei Ke, Yanzheng Cai, Guoyang Zeng,
Zhixing Tan, Zhiyuan Liu, Minlie Huang, Wentao Han, Yang Liu, Xiaoyan Zhu,
Maosong Sun
- Abstract summary: We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
- Score: 71.59893315671997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the size of pre-trained language models (PLMs) has grown by
leaps and bounds. However, efficiency issues of these large-scale PLMs limit
their utilization in real-world scenarios. We present a suite of cost-effective
techniques for the use of PLMs to deal with the efficiency issues of
pre-training, fine-tuning, and inference. (1) We introduce knowledge
inheritance to accelerate the pre-training process by exploiting existing PLMs
instead of training models from scratch. (2) We explore the best practice of
prompt tuning with large-scale PLMs. Compared with conventional fine-tuning,
prompt tuning significantly reduces the number of task-specific parameters. (3)
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs
with limited computational resources. Based on our cost-effective pipeline, we
pre-train two models: an encoder-decoder bilingual model with 11 billion
parameters (CPM-2) and its corresponding MoE version with 198 billion
parameters. In our experiments, we compare CPM-2 with mT5 on downstream tasks.
Experimental results show that CPM-2 has excellent general language
intelligence. Moreover, we validate the efficiency of InfMoE when conducting
inference of large-scale models having tens of billions of parameters on a
single GPU. All source code and model parameters are available at
https://github.com/TsinghuaAI/CPM.
Related papers
- Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study [3.5189934649278922]
Large language models (LLMs) like GitHub Copilot struggle with real-world tasks without fine-tuning.
This paper investigates full fine-tuning and various PEFT methods, including LoRA, (IA)3, and prompt tuning.
Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation.
arXiv Detail & Related papers (2024-11-04T09:03:18Z) - MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies [85.57899012821211]
Small Language Models (SLMs) are a resource-efficient alternative to Large Language Models (LLMs)
We introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants.
We also introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K.
arXiv Detail & Related papers (2024-04-09T15:36:50Z) - Optimizing Distributed Training on Frontier for Large Language Models [7.251642875697334]
Training large language models (LLMs) with billions of parameters poses significant challenges and requires considerable computational resources.
This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer.
arXiv Detail & Related papers (2023-12-20T02:03:15Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Large Product Key Memory for Pretrained Language Models [12.932177565788974]
Product key memory (PKM) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead.
Motivated by the recent success of pretrained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be fine for a wide variety of downstream NLP tasks.
arXiv Detail & Related papers (2020-10-08T10:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.