PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity
Compensation
- URL: http://arxiv.org/abs/2312.17276v1
- Date: Wed, 27 Dec 2023 11:49:24 GMT
- Title: PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity
Compensation
- Authors: Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie,
Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu,
Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao
Xu, Dacheng Tao
- Abstract summary: We present a new efficient model architecture for large language models (LLMs)
We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up.
In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
- Score: 97.78045712375047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent trend of large language models (LLMs) is to increase the scale of
both model size (\aka the number of parameters) and dataset to achieve better
generative ability, which is definitely proved by a lot of work such as the
famous GPT and Llama. However, large models often involve massive computational
costs, and practical applications cannot afford such high prices. However, the
method of constructing a strong model architecture for LLMs is rarely
discussed. We first analyze the state-of-the-art language model architectures
and observe the feature collapse problem. Based on the theoretical analysis, we
propose that the nonlinearity is also very important for language models, which
is usually studied in convolutional neural networks for vision tasks. The
series informed activation function is then introduced with tiny calculations
that can be ignored, and an augmented shortcut is further used to enhance the
model nonlinearity. We then demonstrate that the proposed approach is
significantly effective for enhancing the model nonlinearity through carefully
designed ablations; thus, we present a new efficient model architecture for
establishing modern, namely, PanGu-$\pi$. Experiments are then conducted using
the same dataset and training strategy to compare PanGu-$\pi$ with
state-of-the-art LLMs. The results show that PanGu-$\pi$-7B can achieve a
comparable performance to that of benchmarks with about 10\% inference
speed-up, and PanGu-$\pi$-1B can achieve state-of-the-art performance in terms
of accuracy and efficiency. In addition, we have deployed PanGu-$\pi$-7B in the
high-value domains of finance and law, developing an LLM named YunShan for
practical application. The results show that YunShan can surpass other models
with similar scales on benchmarks.
Related papers
- Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones.
This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z) - Rethinking Optimization and Architecture for Tiny Language Models [39.892066839422796]
The application of language models on mobile devices is facing huge challenge on the computation and memory costs.
In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component.
Several design formulas are empirically proved especially effective for tiny language models.
arXiv Detail & Related papers (2024-02-05T07:59:38Z) - Modeling Choice via Self-Attention [8.394221523847325]
We show that our attention-based choice model is a low-optimal generalization of the Halo Multinomial Logit (Halo-MNL) model.
We also establish the first realistic-scale benchmark for choice estimation on real data, conducting an evaluation of existing models.
arXiv Detail & Related papers (2023-11-11T11:13:07Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Graph-Regularized Tensor Regression: A Domain-Aware Framework for
Interpretable Multi-Way Financial Modelling [23.030263841031633]
We develop a novel Graph-Regularized Regression (GRTR) framework, whereby knowledge about cross-asset relations is incorporated into the model in the form of a graph Laplacian matrix.
By virtue of tensor algebra, the proposed framework is shown to be fully interpretable, both coefficient-wise and dimension-wise.
The GRTR model is validated in a multi-way financial forecasting setting and is shown to achieve improved performance at reduced computational costs.
arXiv Detail & Related papers (2022-10-26T13:39:08Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.