PAGnol: An Extra-Large French Generative Model
- URL: http://arxiv.org/abs/2110.08554v1
- Date: Sat, 16 Oct 2021 11:44:23 GMT
- Title: PAGnol: An Extra-Large French Generative Model
- Authors: Julien Launay, E.L. Tommasone, Baptiste Pannier, Fran\c{c}ois
Boniface, Am\'elie Chatelain, Alessandro Cappelli, Iacopo Poli, Djam\'e
Seddah
- Abstract summary: We introduce PAGnol, a collection of French GPT models.
Using scaling laws, we efficiently train PAGnol-XL with the same computational budget as CamemBERT.
- Score: 53.40189314359048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Access to large pre-trained models of varied architectures, in many different
languages, is central to the democratization of NLP. We introduce PAGnol, a
collection of French GPT models. Using scaling laws, we efficiently train
PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a
model 13 times smaller. PAGnol-XL is the largest model trained to date for the
French language. We plan to train increasingly large and performing versions of
PAGnol, exploring the capabilities of French extreme-scale models.
For this first release, we focus on the pre-training and scaling calculations
underlining PAGnol. We fit a scaling law for compute for the French language,
and compare it with its English counterpart. We find the pre-training dataset
significantly conditions the quality of the outputs, with common datasets such
as OSCAR leading to low-quality offensive text. We evaluate our models on
discriminative and generative tasks in French, comparing to other
state-of-the-art French and multilingual models, and reaching the state of the
art in the abstract summarization task. Our research was conducted on the
public GENCI Jean Zay supercomputer, and our models up to the Large are made
publicly available.
Related papers
- MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis [1.5761916307614148]
We propose the first benchmark of sentence embeddings for French.
We compare 51 carefully selected embedding models on a large scale.
We find that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well.
arXiv Detail & Related papers (2024-05-30T20:34:37Z) - PeLLE: Encoder-based language models for Brazilian Portuguese based on
open data [0.40485107444088947]
We present PeLLE, a family of large language models based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus.
We evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, contrasting performance of large versus smaller-but-curated pretrained models in several downstream tasks.
arXiv Detail & Related papers (2024-02-29T14:34:03Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - Beyond English-Centric Bitexts for Better Multilingual Language
Representation Learning [99.42850643947439]
We show that going beyond English-centric bitexts, coupled with a novel sampling strategy, substantially boosts performance across model sizes.
Our XY-LENT XL variant outperforms XLM-RXXL and exhibits competitive performance with mT5 XXL while being 5x and 6x smaller respectively.
arXiv Detail & Related papers (2022-10-26T17:16:52Z) - Cedille: A large autoregressive French language model [0.21756081703276003]
We introduce Cedille, a large open source auto-regressive language model, specifically trained for the French language.
Our results show that Cedille outperforms existing French language models and is competitive with GPT-3 on a range of French zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-07T17:40:43Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.