Impact of Tokenization on Language Models: An Analysis for Turkish
- URL: http://arxiv.org/abs/2204.08832v1
- Date: Tue, 19 Apr 2022 12:01:46 GMT
- Title: Impact of Tokenization on Language Models: An Analysis for Turkish
- Authors: Cagri Toraman, Eyup Halit Yilmaz, Furkan \c{S}ahinu\c{c}, Oguzhan
Ozcelik
- Abstract summary: We train tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus.
Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers.
We find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers.
- Score: 2.4660652494309936
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Tokenization is an important text preprocessing step to prepare input tokens
for deep language models. WordPiece and BPE are de facto methods employed by
important models, such as BERT and GPT. However, the impact of tokenization can
be different for morphologically rich languages, such as Turkic languages,
where many words can be generated by adding prefixes and suffixes. We compare
five tokenizers at different granularity levels, i.e. their outputs vary from
smallest pieces of characters to the surface form of words, including a
Morphological-level tokenizer. We train these tokenizers and pretrain
medium-sized language models using RoBERTa pretraining procedure on the Turkish
split of the OSCAR corpus. We then fine-tune our models on six downstream
tasks. Our experiments, supported by statistical tests, reveal that
Morphological-level tokenizer has challenging performance with de facto
tokenizers. Furthermore, we find that increasing the vocabulary size improves
the performance of Morphological and Word-level tokenizers more than that of de
facto tokenizers. The ratio of the number of vocabulary parameters to the total
number of model parameters can be empirically chosen as 20% for de facto
tokenizers and 40% for other tokenizers to obtain a reasonable trade-off
between model size and performance.
Related papers
- Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5 [4.779196219827507]
We capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5.
Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others.
arXiv Detail & Related papers (2024-10-15T14:14:19Z) - Rethinking Tokenization: Crafting Better Tokenizers for Large Language
Models [0.0]
Tokenization significantly influences language models(LMs)' performance.
This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types.
The Less-is-Better (LiB) model could be a new approach for LLM tokenizer.
arXiv Detail & Related papers (2024-03-01T10:03:07Z) - Tokenizer Choice For LLM Training: Negligible or Crucial? [30.33170936148845]
We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs.
We find that the tokenizer choice can significantly impact the model's downstream performance and training costs.
We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
arXiv Detail & Related papers (2023-10-12T22:44:19Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5
for Machine Translation [9.736284584478032]
We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited.
While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5.
We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.
arXiv Detail & Related papers (2023-02-28T00:50:19Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.