How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models
- URL: http://arxiv.org/abs/2012.15613v1
- Date: Thu, 31 Dec 2020 14:11:00 GMT
- Title: How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models
- Authors: Phillip Rust, Jonas Pfeiffer, Ivan Vuli\'c, Sebastian Ruder, Iryna
Gurevych
- Abstract summary: We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
- Score: 96.32118305166412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we provide a \textit{systematic empirical comparison} of
pretrained multilingual language models versus their monolingual counterparts
with regard to their monolingual task performance. We study a set of nine
typologically diverse languages with readily available pretrained monolingual
models on a set of five diverse monolingual downstream tasks. We first
establish if a gap between the multilingual and the corresponding monolingual
representation of that language exists, and subsequently investigate the reason
for a performance difference. To disentangle the impacting variables, we train
new monolingual models on the same data, but with different tokenizers, both
the monolingual and the multilingual version. We find that while the
pretraining data size is an important factor, the designated tokenizer of the
monolingual model plays an equally important role in the downstream
performance. Our results show that languages which are adequately represented
in the multilingual model's vocabulary exhibit negligible performance decreases
over their monolingual counterparts. We further find that replacing the
original multilingual tokenizer with the specialized monolingual tokenizer
improves the downstream performance of the multilingual model for almost every
task and language.
Related papers
- Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval [5.446052898856584]
This paper proposes a novel hybrid batch training strategy to improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings.
The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size.
arXiv Detail & Related papers (2024-08-20T04:30:26Z) - Multilingual BERT has an accent: Evaluating English influences on
fluency in multilingual models [23.62852626011989]
We show that grammatical structures in higher-resource languages bleed into lower-resource languages.
We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models.
arXiv Detail & Related papers (2022-10-11T17:06:38Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - Mono vs Multilingual Transformer-based Models: a Comparison across
Several Language Tasks [1.2691047660244335]
BERT (Bidirectional Representations from Transformers) and ALBERT (A Lite BERT) are methods for pre-training language models.
We make available our trained BERT and Albert model for Portuguese.
arXiv Detail & Related papers (2020-07-19T19:13:20Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z) - XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona.
Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.