Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages
- URL: http://arxiv.org/abs/2305.17179v1
- Date: Fri, 26 May 2023 18:06:49 GMT
- Title: Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages
- Authors: Tomasz Limisiewicz and Ji\v{r}\'i Balhar and David Mare\v{c}ek
- Abstract summary: We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
- Score: 3.716965622352967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual language models have recently gained attention as a promising
solution for representing multiple languages in a single model. In this paper,
we propose new criteria to evaluate the quality of lexical representation and
vocabulary overlap observed in sub-word tokenizers. Our findings show that the
overlap of vocabulary across languages can be actually detrimental to certain
downstream tasks (POS, dependency tree labeling). In contrast, NER and
sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing
vocabulary. We also observe that the coverage of the language-specific tokens
in the multilingual vocabulary significantly impacts the word-level tasks. Our
study offers a deeper understanding of the role of tokenizers in multilingual
language models and guidelines for future model developers to choose the most
suitable tokenizer for their specific application before undertaking costly
model pre-training
Related papers
- A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - Allocating Large Vocabulary Capacity for Cross-lingual Language Model
Pre-training [59.571632468137075]
We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity.
We propose an algorithm VoCap to determine the desired vocabulary capacity of each language.
In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax.
arXiv Detail & Related papers (2021-09-15T14:04:16Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Improving Multilingual Models with Language-Clustered Vocabularies [8.587129426070979]
We introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters.
Our experiments show improvements across languages on key multilingual benchmark tasks.
arXiv Detail & Related papers (2020-10-24T04:49:15Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.