Rethinking Optimization and Architecture for Tiny Language Models
- URL: http://arxiv.org/abs/2402.02791v2
- Date: Tue, 6 Feb 2024 05:38:26 GMT
- Title: Rethinking Optimization and Architecture for Tiny Language Models
- Authors: Yehui Tang, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai,
Yi-Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang
- Abstract summary: The application of language models on mobile devices is facing huge challenge on the computation and memory costs.
In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component.
Several design formulas are empirically proved especially effective for tiny language models.
- Score: 39.892066839422796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The power of large language models (LLMs) has been demonstrated through
numerous data and computing resources. However, the application of language
models on mobile devices is facing huge challenge on the computation and memory
costs, that is, tiny language models with high performance are urgently
required. Limited by the highly complex training process, there are many
details for optimizing language models that are seldom studied carefully. In
this study, based on a tiny language model with 1B parameters, we carefully
design a series of empirical study to analyze the effect of each component.
Three perspectives are mainly discussed, \ie, neural architecture, parameter
initialization, and optimization strategy. Several design formulas are
empirically proved especially effective for tiny language models, including
tokenizer compression, architecture tweaking, parameter inheritance and
multiple-round training. Then we train PanGu-$\pi$-1B Pro and PanGu-$\pi$-1.5B
Pro on 1.6T multilingual corpora, following the established formulas.
Experimental results demonstrate the improved optimization and architecture
yield a notable average improvement of 8.87 on benchmark evaluation sets for
PanGu-$\pi$-1B Pro. Besides, PanGu-$\pi$-1.5B Pro surpasses a range of SOTA
models with larger model sizes, validating its superior performance. The code
is available at https://github.com/YuchuanTian/RethinkTinyLM.
Related papers
- PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity
Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs)
We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up.
In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - Analyzing Bagging Methods for Language Models [0.5161531917413708]
We perform an analysis of bagging language models and compare single language models to bagged ensembles that are roughly equivalent in terms of final model size.
Our ensembling methods are at best roughly equivalent to single LM baselines.
arXiv Detail & Related papers (2022-07-19T06:30:37Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - Probing Structured Pruning on Multilingual Pre-trained Models: Settings,
Algorithms, and Efficiency [62.0887259003594]
This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency.
Experiments on nine downstream tasks show several counter-intuitive phenomena.
We present Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference.
arXiv Detail & Related papers (2022-04-06T06:29:52Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.