FPM: A Collection of Large-scale Foundation Pre-trained Language Models
- URL: http://arxiv.org/abs/2111.04909v1
- Date: Tue, 9 Nov 2021 02:17:15 GMT
- Title: FPM: A Collection of Large-scale Foundation Pre-trained Language Models
- Authors: Dezhou Shen
- Abstract summary: We use the current effective model structure to launch a model set through the current most mainstream technology.
We think this will become the basic model in the future.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work in language modeling has shown that training large-scale
Transformer models has promoted the latest developments in natural language
processing applications. However, there is very little work to unify the
current effective models. In this work, we use the current effective model
structure to launch a model set through the current most mainstream technology.
We think this will become the basic model in the future. For Chinese, using the
GPT-2[9] model, a 10.3 billion parameter language model was trained on the
Chinese dataset, and, in particular, a 2.9 billion parameter language model
based on dialogue data was trained; the BERT model was trained on the Chinese
dataset with 495 million parameters; the Transformer model has trained a
language model with 5.6 billion parameters on the Chinese dataset. In English,
corresponding training work has also been done. Using the GPT-2 model, a
language model with 6.4 billion parameters was trained on the English dataset;
the BERT[3] model trained a language model with 1.24 billion parameters on the
English dataset, and in particular, it trained a 688 million parameter based on
single card training technology Language model; Transformer model trained a
language model with 5.6 billion parameters on the English dataset. In the TNEWS
classification task evaluated by CLUE[13], the BERT-C model exceeded the 59.46%
accuracy of ALBERT-xxlarge with an accuracy rate of 59.99%, an increase of
0.53%. In the QQP classification task evaluated by GLUE[11], the accuracy rate
of 78.95% surpassed the accuracy rate of BERT-Large of 72.1%, an increase of
6.85%. Compared with the current accuracy rate of ERNIE, the first place in the
GLUE evaluation of 75.2%, an increase of 3.75%.
Related papers
- DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - Investigating Pre-trained Language Models on Cross-Domain Datasets, a
Step Closer to General AI [0.8889304968879164]
We investigate the ability of pre-trained language models to generalize to different non-language tasks.
The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results.
arXiv Detail & Related papers (2023-06-21T11:55:17Z) - Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter
Encoders for Natural Language Understanding Systems [63.713297451300086]
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B.
Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
arXiv Detail & Related papers (2022-06-15T20:44:23Z) - Training Compute-Optimal Large Language Models [54.00424650998489]
We train language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.
For compute-optimal training, the model size and the number of training tokens should be scaled equally.
chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B)
arXiv Detail & Related papers (2022-03-29T13:38:03Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Zero-Shot Cross-Lingual Transfer in Legal Domain Using Transformer
models [0.0]
We study zero-shot cross-lingual transfer from English to French and German under Multi-Label Text Classification.
We extend EURLEX57K dataset, the English dataset for topic classification of legal documents, with French and German official translation.
We find that Language model finetuning of multi-lingual pre-trained model (M-DistilBERT, M-BERT) leads to 32.0-34.94%, 76.15-87.54% relative improvement on French and German test sets.
arXiv Detail & Related papers (2021-11-28T16:25:04Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z) - Scaling End-to-End Models for Large-Scale Multilingual ASR [44.89961662796597]
Building ASR models across many language families is a challenging multi-task learning problem due to large language variations and heavily unbalanced data.
We conduct a capacity study on a 15-language task, with the amount of data per language varying from 7.7K to 54.7K hours.
arXiv Detail & Related papers (2021-04-30T08:24:11Z) - Transferring Monolingual Model to Low-Resource Language: The Case of
Tigrinya [0.0]
We propose a cost-effective transfer learning method to adopt a strong source language model.
With only 10k examples of the given Tigrinya sentiment analysis dataset, English XLNet has achieved 78.88% F1-Score.
Fine-tuning (English) XLNet model on the CLS dataset has promising results compared to mBERT and even outperformed mBERT for one dataset of the Japanese language.
arXiv Detail & Related papers (2020-06-13T18:53:22Z) - DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.