Related papers: Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

URL: http://arxiv.org/abs/2410.11627v2
Date: Mon, 21 Oct 2024 21:49:33 GMT
Title: Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5
Authors: Thao Anh Dang, Limor Raviv, Lukas Galke,
Abstract summary: We capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5. Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others.
Score: 4.779196219827507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Morphology is a crucial factor for multilingual language modeling as it poses direct challenges for tokenization. Here, we seek to understand how tokenization influences the morphological knowledge encoded in multilingual language models. Specifically, we capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5. The two models share the same architecture, training objective, and training data and only differ in their tokenization strategies: subword tokenization vs.\@ character-level tokenization. Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others and that morphological information is encoded in the middle and late layers. Finally, we show that languages with more irregularities benefit more from having a higher share of the pre-training data.

Related papers

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages [15.203789021094982]
In large language models (LLMs), how are multiple languages learned and encoded? We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages.
arXiv Detail & Related papers (2025-01-10T21:18:21Z)
Why do language models perform worse for morphologically complex languages? [0.913127392774573]
We find new evidence for a performance gap between agglutinative and fusional languages. We propose three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. Results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology.
arXiv Detail & Related papers (2024-11-21T15:06:51Z)
Language Model Tokenizers Introduce Unfairness Between Languages [98.92630681729518]
We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. We make the case that we should train future language models using multilingually fair subword tokenizers.
arXiv Detail & Related papers (2023-05-17T14:17:57Z)
Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer [4.554080966463776]
Multi-lingual language models (LM) have been remarkably successful in enabling natural language tasks in low-resource languages. We try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages. A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer.
arXiv Detail & Related papers (2022-12-04T07:22:21Z)
Causal Analysis of Syntactic Agreement Neurons in Multilingual Language Models [28.036233760742125]
We causally probe multilingual language models (XGLM and multilingual BERT) across various languages. We find significant neuron overlap across languages in autoregressive multilingual language models, but not masked language models.
arXiv Detail & Related papers (2022-10-25T20:43:36Z)
Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process. We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z)
Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models [84.86942006830772]
We conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar. We conduct the first large-scale empirical study over 43 languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe.
arXiv Detail & Related papers (2022-05-04T12:22:31Z)
Impact of Tokenization on Language Models: An Analysis for Turkish [2.4660652494309936]
We train tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. We find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers.
arXiv Detail & Related papers (2022-04-19T12:01:46Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z)
What does it mean to be language-agnostic? Probing multilingual sentence encoders for typological properties [17.404220737977738]
We propose methods for probing sentence representations from state-of-the-art multilingual encoders. Our results show interesting differences in encoding linguistic variation associated with different pretraining strategies.
arXiv Detail & Related papers (2020-09-27T15:00:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.