MonoByte: A Pool of Monolingual Byte-level Language Models
- URL: http://arxiv.org/abs/2209.11035v1
- Date: Thu, 22 Sep 2022 14:32:48 GMT
- Title: MonoByte: A Pool of Monolingual Byte-level Language Models
- Authors: Hugo Abonizio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo
Nogueira
- Abstract summary: We release 10 monolingual byte-level models rigorously pretrained under the same configuration.
Because they are tokenizer-free, the problem of unseen token embeddings is eliminated.
Experiments on QA and NLI tasks show that our monolingual models achieve competitive performance to the multilingual one.
- Score: 4.491765479948667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The zero-shot cross-lingual ability of models pretrained on multilingual and
even monolingual corpora has spurred many hypotheses to explain this intriguing
empirical result. However, due to the costs of pretraining, most research uses
public models whose pretraining methodology, such as the choice of
tokenization, corpus size, and computational budget, might differ drastically.
When researchers pretrain their own models, they often do so under a
constrained budget, and the resulting models might underperform significantly
compared to SOTA models. These experimental differences led to various
inconsistent conclusions about the nature of the cross-lingual ability of these
models. To help further research on the topic, we released 10 monolingual
byte-level models rigorously pretrained under the same configuration with a
large compute budget (equivalent to 420 days on a V100) and corpora that are 4
times larger than the original BERT's. Because they are tokenizer-free, the
problem of unseen token embeddings is eliminated, thus allowing researchers to
try a wider range of cross-lingual experiments in languages with different
scripts. Additionally, we release two models pretrained on non-natural language
texts that can be used in sanity-check experiments. Experiments on QA and NLI
tasks show that our monolingual models achieve competitive performance to the
multilingual one, and hence can be served to strengthen our understanding of
cross-lingual transferability in language models.
Related papers
- Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer.
We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio.
We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z) - Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques.
Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z) - Do Multilingual Large Language Models Mitigate Stereotype Bias? [9.31741279000585]
This study systematically trains six LLMs of identical size and architecture in English, German, French, Italian, and Spanish.
We observe that multilingual models achieve not only lower bias but also superior prediction accuracy when compared to monolingual models.
arXiv Detail & Related papers (2024-07-08T08:46:50Z) - Language Versatilists vs. Specialists: An Empirical Revisiting on
Multilingual Transfer Ability [11.000499414131324]
We conduct experiments across four types of reasoning tasks.
We find that the multilingual pretrained model does not always outperform an English-centric model.
English appears to be a less suitable source language, and the choice of source language becomes less important when the English-centric model scales up.
arXiv Detail & Related papers (2023-06-11T14:03:09Z) - Causal Analysis of Syntactic Agreement Neurons in Multilingual Language
Models [28.036233760742125]
We causally probe multilingual language models (XGLM and multilingual BERT) across various languages.
We find significant neuron overlap across languages in autoregressive multilingual language models, but not masked language models.
arXiv Detail & Related papers (2022-10-25T20:43:36Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Probing Structured Pruning on Multilingual Pre-trained Models: Settings,
Algorithms, and Efficiency [62.0887259003594]
This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency.
Experiments on nine downstream tasks show several counter-intuitive phenomena.
We present Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference.
arXiv Detail & Related papers (2022-04-06T06:29:52Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.