ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation
- URL: http://arxiv.org/abs/2107.02137v1
- Date: Mon, 5 Jul 2021 16:54:59 GMT
- Title: ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation
- Authors: Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan
Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua
Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan
Ouyang, Dianhai Yu, Hao Tian, Hua Wu, Haifeng Wang
- Abstract summary: We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
- Score: 25.430130072811075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained models have achieved state-of-the-art results in various Natural
Language Processing (NLP) tasks. Recent works such as T5 and GPT-3 have shown
that scaling up pre-trained language models can improve their generalization
abilities. Particularly, the GPT-3 model with 175 billion parameters shows its
strong task-agnostic zero-shot/few-shot learning capabilities. Despite their
success, these large-scale models are trained on plain texts without
introducing knowledge such as linguistic knowledge and world knowledge. In
addition, most large-scale models are trained in an auto-regressive way. As a
result, this kind of traditional fine-tuning approach demonstrates relatively
weak performance when solving downstream language understanding tasks. In order
to solve the above problems, we propose a unified framework named ERNIE 3.0 for
pre-training large-scale knowledge enhanced models. It fuses auto-regressive
network and auto-encoding network, so that the trained model can be easily
tailored for both natural language understanding and generation tasks with
zero-shot learning, few-shot learning or fine-tuning. We trained the model with
10 billion parameters on a 4TB corpus consisting of plain texts and a
large-scale knowledge graph. Empirical results show that the model outperforms
the state-of-the-art models on 54 Chinese NLP tasks, and its English version
achieves the first place on the SuperGLUE benchmark (July 3, 2021), surpassing
the human performance by +0.8% (90.6% vs. 89.8%).
Related papers
- Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning.
This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters.
This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z) - Pretrained Generative Language Models as General Learning Frameworks for
Sequence-Based Tasks [0.0]
We propose that small pretrained foundational generative language models can be utilized as a general learning framework for sequence-based tasks.
Our proposal overcomes the computational resource, skill set, and timeline challenges associated with training neural networks and language models from scratch.
We demonstrate that 125M, 350M, and 1.3B parameter pretrained foundational language models can be instruction fine-tuned with 10,000-to-1,000,000 instruction examples.
arXiv Detail & Related papers (2024-02-08T12:19:32Z) - Large Language Models Are Also Good Prototypical Commonsense Reasoners [11.108562540123387]
Traditional fine-tuning approaches can be resource-intensive and potentially compromise a model's generalization capacity.
We draw inspiration from the outputs of large models for tailored tasks and semi-automatically developed a set of novel prompts.
With better designed prompts we can achieve the new state-of-art(SOTA) on the ProtoQA leaderboard.
arXiv Detail & Related papers (2023-09-22T20:07:24Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training
for Language Understanding and Generation [50.036392756981016]
GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential.
A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models.
ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks.
arXiv Detail & Related papers (2021-12-23T17:35:48Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Finetuned Language Models Are Zero-Shot Learners [67.70352207685558]
We show that instruction tuning boosts zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates.
We evaluate this instruction-tuned model, which we call FLAN, on unseen task types.
arXiv Detail & Related papers (2021-09-03T17:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.