ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training
for Language Understanding and Generation
- URL: http://arxiv.org/abs/2112.12731v1
- Date: Thu, 23 Dec 2021 17:35:48 GMT
- Title: ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training
for Language Understanding and Generation
- Authors: Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong,
Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi Chen,
Yuxiang Lu, Weixin Liu, Xi Wang, Yangfan Bai, Qiuliang Chen, Li Zhao, Shiyong
Li, Peng Sun, Dianhai Yu, Yanjun Ma, Hao Tian, Hua Wu, Tian Wu, Wei Zeng, Ge
Li, Wen Gao, Haifeng Wang
- Abstract summary: GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential.
A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models.
ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks.
- Score: 50.036392756981016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models have achieved state-of-the-art results in various
Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up
pre-trained language models can further exploit their enormous potential. A
unified framework named ERNIE 3.0 was recently proposed for pre-training
large-scale knowledge enhanced models and trained a model with 10 billion
parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP
tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a
hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion
parameters on the PaddlePaddle platform. Furthermore, we design a
self-supervised adversarial loss and a controllable language modeling loss to
make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the
computation overhead and carbon emission, we propose an online distillation
framework for ERNIE 3.0 Titan, where the teacher model will teach students and
train itself simultaneously. ERNIE 3.0 Titan is the largest Chinese dense
pre-trained model so far. Empirical results show that the ERNIE 3.0 Titan
outperforms the state-of-the-art models on 68 NLP datasets.
Related papers
- Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models.
Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - GLM-130B: An Open Bilingual Pre-trained Model [56.694470924635624]
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters.
It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained.
arXiv Detail & Related papers (2022-10-05T17:34:44Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A
Large-Scale Generative Language Model [35.75234515196426]
We present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG) with 530 billion parameters.
We show that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results.
arXiv Detail & Related papers (2022-01-28T08:59:57Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.