Language Model Tokenizers Introduce Unfairness Between Languages
- URL: http://arxiv.org/abs/2305.15425v2
- Date: Fri, 20 Oct 2023 10:09:55 GMT
- Title: Language Model Tokenizers Introduce Unfairness Between Languages
- Authors: Aleksandar Petrov, Emanuele La Malfa, Philip H.S. Torr, Adel Bibi
- Abstract summary: We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked.
Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs.
We make the case that we should train future language models using multilingually fair subword tokenizers.
- Score: 98.92630681729518
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent language models have shown impressive multilingual performance, even
when not explicitly trained for it. Despite this, there are concerns about the
quality of their outputs across different languages. In this paper, we show how
disparity in the treatment of different languages arises at the tokenization
stage, well before a model is even invoked. The same text translated into
different languages can have drastically different tokenization lengths, with
differences up to 15 times in some cases. These disparities persist even for
tokenizers that are intentionally trained for multilingual support.
Character-level and byte-level models also exhibit over 4 times the difference
in the encoding length for some language pairs. This induces unfair treatment
for some language communities in regard to the cost of accessing commercial
language services, the processing time and latency, as well as the amount of
content that can be provided as context to the models. Therefore, we make the
case that we should train future language models using multilingually fair
subword tokenizers.
Related papers
- The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Are Multilingual Models Effective in Code-Switching? [57.78477547424949]
We study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting.
Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching.
arXiv Detail & Related papers (2021-03-24T16:20:02Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - When Being Unseen from mBERT is just the Beginning: Handling New
Languages With Multilingual Language Models [2.457872341625575]
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP.
We show that such models behave in multiple ways on unseen languages.
arXiv Detail & Related papers (2020-10-24T10:15:03Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.