BERTIN: Efficient Pre-Training of a Spanish Language Model using
Perplexity Sampling
- URL: http://arxiv.org/abs/2207.06814v1
- Date: Thu, 14 Jul 2022 10:48:42 GMT
- Title: BERTIN: Efficient Pre-Training of a Spanish Language Model using
Perplexity Sampling
- Authors: Javier de la Rosa, Eduardo G. Ponferrada, Paulo Villegas, Pablo
Gonzalez de Prado Salas, Manu Romero, Mar{\i}a Grandury
- Abstract summary: Common Crawl might contain enough noise to make this pre-training sub-optimal.
We present a novel data-centric technique which enables the pre-training of language models in roughly half the amount of steps.
Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The pre-training of large language models usually requires massive amounts of
resources, both in terms of computation and data. Frequently used web sources
such as Common Crawl might contain enough noise to make this pre-training
sub-optimal. In this work, we experiment with different sampling methods from
the Spanish version of mC4, and present a novel data-centric technique which we
name $\textit{perplexity sampling}$ that enables the pre-training of language
models in roughly half the amount of steps and using one fifth of the data. The
resulting models are comparable to the current state-of-the-art, and even
achieve better results for certain tasks. Our work is proof of the versatility
of Transformers, and paves the way for small teams to train their models on a
limited budget. Our models are available at this
$\href{https://huggingface.co/bertin-project}{URL}$.
Related papers
- "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing [9.338266891598973]
CLASP generates synthetic data from AlexaTM 20B to augment the training set for a model 40x smaller (500M parameters)
We evaluate on two datasets in low-resource settings: English PIZZA, containing either 348 or 16 real examples, and mTOP cross-lingual zero-shot, where training data is available only in English.
arXiv Detail & Related papers (2022-10-13T15:01:03Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z) - Introducing various Semantic Models for Amharic: Experimentation and
Evaluation with multiple Tasks and Datasets [19.855120632909124]
We introduce different semantic models for Amharic.
Models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings.
We find that newly trained models perform better than pre-trained multilingual models.
arXiv Detail & Related papers (2020-11-02T17:48:25Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Exploring Versatile Generative Language Model Via Parameter-Efficient
Transfer Learning [70.81910984985683]
We propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pre-trained model.
The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.
arXiv Detail & Related papers (2020-04-08T06:18:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.