Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
- URL: http://arxiv.org/abs/2110.06696v2
- Date: Thu, 14 Oct 2021 09:00:20 GMT
- Title: Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
- Authors: Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua,
Yulong Wang, Ming Zhou
- Abstract summary: Mengzi stands for a family of discriminative, generative, domain-specific, and multimodal pre-trained model variants.
Compared with public Chinese PLMs, Mengzi is simple but more powerful.
Our lightweight model has achieved new state-of-the-art results on the widely-used CLUE benchmark.
- Score: 33.83704598544326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although pre-trained models (PLMs) have achieved remarkable improvements in a
wide range of NLP tasks, they are expensive in terms of time and resources.
This calls for the study of training more efficient models with less
computation but still ensures impressive performance. Instead of pursuing a
larger scale, we are committed to developing lightweight yet more powerful
models trained with equal or less computation and friendly to rapid deployment.
This technical report releases our pre-trained model called Mengzi, which
stands for a family of discriminative, generative, domain-specific, and
multimodal pre-trained model variants, capable of a wide range of language and
vision tasks. Compared with public Chinese PLMs, Mengzi is simple but more
powerful. Our lightweight model has achieved new state-of-the-art results on
the widely-used CLUE benchmark with our optimized pre-training and fine-tuning
techniques. Without modifying the model architecture, our model can be easily
employed as an alternative to existing PLMs. Our sources are available at
https://github.com/Langboat/Mengzi.
Related papers
- Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model [20.054342930450055]
This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models.
PLRD allows for significant reductions in computational overhead and energy consumption.
Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs.
arXiv Detail & Related papers (2024-06-28T15:27:57Z) - Model Stock: All we need is just a few fine-tuned models [34.449901046895185]
This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance.
We employ significantly fewer models to achieve final weights yet yield superior accuracy.
We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures.
arXiv Detail & Related papers (2024-03-28T15:57:20Z) - Evolutionary Optimization of Model Merging Recipes [21.41838972039297]
We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models.
We propose an evolutionary approach that overcomes the limitation by automatically discovering effective combinations of diverse open-source models.
This work contributes new state-of-the-art models back to the open-source community, and also introduces a new paradigm for automated model composition.
arXiv Detail & Related papers (2024-03-19T22:56:53Z) - MindLLM: Pre-training Lightweight Large Language Model from Scratch,
Evaluations and Domain Applications [46.337078949637345]
We present MindLLM, a novel series of bilingual lightweight large language models, trained from scratch.
A thorough account of experiences accrued during large model development is given, covering every step of the process.
MindLLM consistently matches or surpasses the performance of other open-source larger models on some public benchmarks.
arXiv Detail & Related papers (2023-10-24T12:22:34Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - One-stop Training of Multiple Capacity Models [74.87789190840527]
We propose a novel one-stop training framework to jointly train high-capacity and low-capactiy models.
Unlike knowledge distillation, where multiple capacity models are trained from scratch separately, our approach integrates supervisions from different capacity models simultaneously.
arXiv Detail & Related papers (2023-05-23T13:44:09Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.