PanGu-{\Sigma}: Towards Trillion Parameter Language Model with Sparse
Heterogeneous Computing
- URL: http://arxiv.org/abs/2303.10845v1
- Date: Mon, 20 Mar 2023 03:39:27 GMT
- Title: PanGu-{\Sigma}: Towards Trillion Parameter Language Model with Sparse
Heterogeneous Computing
- Authors: Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang,
Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory
Arshinov, Andrey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng
Su, Qun Liu, Jun Yao
- Abstract summary: PanGu-Sigma is trained on a cluster of Ascend 910 AI processors and MindSpore framework.
It provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.
- Score: 64.53242758625922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scaling of large language models has greatly improved natural language
understanding, generation, and reasoning. In this work, we develop a system
that trained a trillion-parameter language model on a cluster of Ascend 910 AI
processors and MindSpore framework, and present the language model with 1.085T
parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha},
we extend the dense Transformer model to sparse one with Random Routed Experts
(RRE), and efficiently train the model over 329B tokens by using Expert
Computation and Storage Separation(ECSS). This resulted in a 6.3x increase in
training throughput through heterogeneous computing. Our experimental findings
show that PanGu-{\Sigma} provides state-of-the-art performance in zero-shot
learning of various Chinese NLP downstream tasks. Moreover, it demonstrates
strong abilities when fine-tuned in application data of open-domain dialogue,
question answering, machine translation and code generation.
Related papers
- Rethinking Optimization and Architecture for Tiny Language Models [39.892066839422796]
The application of language models on mobile devices is facing huge challenge on the computation and memory costs.
In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component.
Several design formulas are empirically proved especially effective for tiny language models.
arXiv Detail & Related papers (2024-02-05T07:59:38Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - Designing Effective Sparse Expert Models [45.21279650229869]
Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to larger and more capable language models.
But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning.
We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer.
arXiv Detail & Related papers (2022-02-17T21:39:10Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language
Models with Auto-parallel Computation [58.31465205357637]
We present our practice on training large-scale autoregressive language models named PanGu-$alpha$, with up to 200 billion parameters.
PanGu-$alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors.
arXiv Detail & Related papers (2021-04-26T06:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.