PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language
Models with Auto-parallel Computation
- URL: http://arxiv.org/abs/2104.12369v1
- Date: Mon, 26 Apr 2021 06:59:36 GMT
- Title: PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language
Models with Auto-parallel Computation
- Authors: Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin
Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong,
Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Qi Guo, Yue Yu, Yan Zhang,
Jin Wang, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han
Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang,
Mingyue Guo, Shanzhi Gu, Gaojun Fan, Yaowei Wang, Xuefeng Jin, Qun Liu,
Yonghong Tian
- Abstract summary: We present our practice on training large-scale autoregressive language models named PanGu-$alpha$, with up to 200 billion parameters.
PanGu-$alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors.
- Score: 58.31465205357637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale Pretrained Language Models (PLMs) have become the new paradigm
for Natural Language Processing (NLP). PLMs with hundreds of billions
parameters such as GPT-3 have demonstrated strong performances on natural
language understanding and generation with \textit{few-shot in-context}
learning. In this work, we present our practice on training large-scale
autoregressive language models named PanGu-$\alpha$, with up to 200 billion
parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a
cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is
implemented based on MindSpore Auto-parallel, which composes five parallelism
dimensions to scale the training task to 2048 processors efficiently, including
data parallelism, op-level model parallelism, pipeline model parallelism,
optimizer model parallelism and rematerialization. To enhance the
generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese
data from a wide range of domains to pretrain the model. We empirically test
the generation ability of PanGu-$\alpha$ in various scenarios including text
summarization, question answering, dialogue generation, etc. Moreover, we
investigate the effect of model scales on the few-shot performances across a
broad range of Chinese NLP tasks. The experimental results demonstrate the
superior capabilities of PanGu-$\alpha$ in performing various tasks under
few-shot or zero-shot settings.
Related papers
- Investigating the translation capabilities of Large Language Models trained on parallel data only [1.5974665548135587]
Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks.
We introduce PLUME, a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples.
These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones.
arXiv Detail & Related papers (2024-06-13T14:08:56Z) - Pretrained Generative Language Models as General Learning Frameworks for
Sequence-Based Tasks [0.0]
We propose that small pretrained foundational generative language models can be utilized as a general learning framework for sequence-based tasks.
Our proposal overcomes the computational resource, skill set, and timeline challenges associated with training neural networks and language models from scratch.
We demonstrate that 125M, 350M, and 1.3B parameter pretrained foundational language models can be instruction fine-tuned with 10,000-to-1,000,000 instruction examples.
arXiv Detail & Related papers (2024-02-08T12:19:32Z) - PanGu-{\Sigma}: Towards Trillion Parameter Language Model with Sparse
Heterogeneous Computing [64.53242758625922]
PanGu-Sigma is trained on a cluster of Ascend 910 AI processors and MindSpore framework.
It provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.
arXiv Detail & Related papers (2023-03-20T03:39:27Z) - Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple
Tasks [77.90900650816046]
We introduce $textZemi$, a zero-shot semi-parametric language model.
We train $textZemi$ with a novel semi-parametric multitask prompted training paradigm.
Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus.
arXiv Detail & Related papers (2022-10-01T04:08:50Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.