Related papers: GLM-130B: An Open Bilingual Pre-trained Model

GLM-130B: An Open Bilingual Pre-trained Model

URL: http://arxiv.org/abs/2210.02414v2
Date: Wed, 25 Oct 2023 05:22:43 GMT
Title: GLM-130B: An Open Bilingual Pre-trained Model
Authors: Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, Jie Tang
Abstract summary: We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained.
Score: 56.694470924635624
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

Related papers

Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs [96.68469559192846]
We present two differently sized MoE large language models (LLMs) Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. We propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency.
arXiv Detail & Related papers (2025-03-07T04:43:39Z)
PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency [4.122864669557465]
We introduce PLaMo-100B, a large-scale language model designed for Japanese proficiency. The model was trained from scratch using 2 trillion tokens. Benchmark evaluations suggest that PLaMo-100B performs well, particularly in Japanese-specific tasks.
arXiv Detail & Related papers (2024-10-10T02:59:36Z)
GEB-1.3B: Open Lightweight Large Language Model [12.083014082506281]
We introduce GEB-1.3B, a lightweight large language model (LLMs) trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. The release of GEB-1.3B as an open-source model marks a significant contribution to the development
arXiv Detail & Related papers (2024-06-14T10:15:49Z)
What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z)
OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z)
GPT-NeoX-20B: An Open-Source Autoregressive Language Model [16.27825182552061]
GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile. Weights will be made freely and openly available to the public through a permissive license.
arXiv Detail & Related papers (2022-04-14T04:00:27Z)
E-LANG: Energy-Based Joint Inferencing of Super and Swift Language Models [9.36591003178585]
This paper proposes an effective dynamic inference approach, called E-Lang, which distributes the inference between large accurate Super-models and light-weight Swift models. E-Lang is easily adoptable and architecture agnostic. Unlike existing methods that are only applicable to encoder-only backbones and classification tasks, our method also works for encoder-decoder structures and sequence-to-sequence tasks such as translation.
arXiv Detail & Related papers (2022-03-01T21:21:27Z)
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [50.036392756981016]
GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks.
arXiv Detail & Related papers (2021-12-23T17:35:48Z)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model) GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z)
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models. It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks. We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.