GLM-130B: An Open Bilingual Pre-trained Model
- URL: http://arxiv.org/abs/2210.02414v2
- Date: Wed, 25 Oct 2023 05:22:43 GMT
- Title: GLM-130B: An Open Bilingual Pre-trained Model
- Authors: Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding,
Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei
Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, Jie Tang
- Abstract summary: We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters.
It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained.
- Score: 56.694470924635624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language
model with 130 billion parameters. It is an attempt to open-source a 100B-scale
model at least as good as GPT-3 (davinci) and unveil how models of such a scale
can be successfully pre-trained. Over the course of this effort, we face
numerous unexpected technical and engineering challenges, particularly on loss
spikes and divergence. In this paper, we introduce the training process of
GLM-130B including its design choices, training strategies for both efficiency
and stability, and engineering efforts. The resultant GLM-130B model offers
significant outperformance over GPT-3 175B (davinci) on a wide range of popular
English benchmarks while the performance advantage is not observed in OPT-175B
and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN
3.0 260B -- the largest Chinese language model -- across related benchmarks.
Finally, we leverage a unique scaling property of GLM-130B to reach INT4
quantization without post training, with almost no performance loss, making it
the first among 100B-scale models and more importantly, allowing its effective
inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the
most affordable GPUs required for using 100B-scale models. The GLM-130B model
weights are publicly accessible and its code, training logs, related toolkit,
and lessons learned are open-sourced at
\url{https://github.com/THUDM/GLM-130B/}.
Related papers
- PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency [4.122864669557465]
We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency.
The model was trained from scratch using 2 trillion tokens.
Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks.
arXiv Detail & Related papers (2024-10-10T02:59:36Z) - GEB-1.3B: Open Lightweight Large Language Model [12.083014082506281]
We introduce GEB-1.3B, a lightweight large language model (LLMs) trained on 550 billion tokens in both Chinese and English languages.
We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance.
GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B.
The release of GEB-1.3B as an open-source model marks a significant contribution to the development
arXiv Detail & Related papers (2024-06-14T10:15:49Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.
We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z) - GPT-NeoX-20B: An Open-Source Autoregressive Language Model [16.27825182552061]
GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile.
Weights will be made freely and openly available to the public through a permissive license.
arXiv Detail & Related papers (2022-04-14T04:00:27Z) - E-LANG: Energy-Based Joint Inferencing of Super and Swift Language
Models [9.36591003178585]
This paper proposes an effective dynamic inference approach, called E-Lang, which distributes the inference between large accurate Super-models and light-weight Swift models.
E-Lang is easily adoptable and architecture agnostic.
Unlike existing methods that are only applicable to encoder-only backbones and classification tasks, our method also works for encoder-decoder structures and sequence-to-sequence tasks such as translation.
arXiv Detail & Related papers (2022-03-01T21:21:27Z) - ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training
for Language Understanding and Generation [50.036392756981016]
GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential.
A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models.
ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks.
arXiv Detail & Related papers (2021-12-23T17:35:48Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.