Related papers: PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation

PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation

URL: http://arxiv.org/abs/2312.17276v1
Date: Wed, 27 Dec 2023 11:49:24 GMT
Title: PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation
Authors: Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao Xu, Dacheng Tao
Abstract summary: We present a new efficient model architecture for large language models (LLMs) We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up. In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
Score: 97.78045712375047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of constructing a strong model architecture for LLMs is rarely discussed. We first analyze the state-of-the-art language model architectures and observe the feature collapse problem. Based on the theoretical analysis, we propose that the nonlinearity is also very important for language models, which is usually studied in convolutional neural networks for vision tasks. The series informed activation function is then introduced with tiny calculations that can be ignored, and an augmented shortcut is further used to enhance the model nonlinearity. We then demonstrate that the proposed approach is significantly effective for enhancing the model nonlinearity through carefully designed ablations; thus, we present a new efficient model architecture for establishing modern, namely, PanGu-$\pi$. Experiments are then conducted using the same dataset and training strategy to compare PanGu-$\pi$ with state-of-the-art LLMs. The results show that PanGu-$\pi$-7B can achieve a comparable performance to that of benchmarks with about 10\% inference speed-up, and PanGu-$\pi$-1B can achieve state-of-the-art performance in terms of accuracy and efficiency. In addition, we have deployed PanGu-$\pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application. The results show that YunShan can surpass other models with similar scales on benchmarks.

Related papers

Theoretical Foundations of Scaling Law in Familial Models [46.506708373314375]
We introduce Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D)<n>Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent.<n> Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable.
arXiv Detail & Related papers (2025-12-29T12:01:58Z)
Towards a Comprehensive Scaling Law of Mixture-of-Experts [54.117786590884776]
We propose a comprehensive and precise joint MoE scaling law that considers all essential factors.<n>Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size.<n>Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.
arXiv Detail & Related papers (2025-09-28T06:35:34Z)
HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling [52.58723853697152]
We propose a Hybrid Architecture Distillation (HAD) approach for DNA sequence modeling.<n>We employ the NTv2-500M as the teacher model and devise a grouping masking strategy.<n>Compared to models with similar parameters, our model achieved excellent performance.
arXiv Detail & Related papers (2025-05-27T07:57:35Z)
An Efficient deep learning model to Predict Stock Price Movement Based on Limit Order Book [11.613073850152873]
In high-frequency trading (HFT), leveraging limit order books (LOB) to model stock price movements is crucial for achieving profitable outcomes.<n>Even recent deep learning models often struggle to capture price movement patterns effectively.<n>We propose a novel approach in which leverages the Siamese architecture to enhance the performance of existing deep learning models.
arXiv Detail & Related papers (2025-05-14T12:46:21Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Scalable Language Models with Posterior Inference of Latent Thought Vectors [52.63299874322121]
Latent-Thought Language Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space. LTMs possess additional scaling dimensions beyond traditional LLMs, yielding a structured design space. LTMs significantly outperform conventional autoregressive models and discrete diffusion models in validation perplexity and zero-shot language modeling.
arXiv Detail & Related papers (2025-02-03T17:50:34Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z)
Rethinking Optimization and Architecture for Tiny Language Models [39.892066839422796]
The application of language models on mobile devices is facing huge challenge on the computation and memory costs. In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component. Several design formulas are empirically proved especially effective for tiny language models.
arXiv Detail & Related papers (2024-02-05T07:59:38Z)
Modeling Choice via Self-Attention [8.394221523847325]
We show that our attention-based choice model is a low-optimal generalization of the Halo Multinomial Logit (Halo-MNL) model. We also establish the first realistic-scale benchmark for choice estimation on real data, conducting an evaluation of existing models.
arXiv Detail & Related papers (2023-11-11T11:13:07Z)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Graph-Regularized Tensor Regression: A Domain-Aware Framework for Interpretable Multi-Way Financial Modelling [23.030263841031633]
We develop a novel Graph-Regularized Regression (GRTR) framework, whereby knowledge about cross-asset relations is incorporated into the model in the form of a graph Laplacian matrix. By virtue of tensor algebra, the proposed framework is shown to be fully interpretable, both coefficient-wise and dimension-wise. The GRTR model is validated in a multi-way financial forecasting setting and is shown to achieve improved performance at reduced computational costs.
arXiv Detail & Related papers (2022-10-26T13:39:08Z)
Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training. We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.