Tokenizer Choice For LLM Training: Negligible or Crucial?
- URL: http://arxiv.org/abs/2310.08754v4
- Date: Sun, 17 Mar 2024 14:17:22 GMT
- Title: Tokenizer Choice For LLM Training: Negligible or Crucial?
- Authors: Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Schulze Buschhoff, Charvi Jain, Alexander Arno Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, Stefan Kesselheim, Nicolas Flores-Herr,
- Abstract summary: We study the influence of tokenizer choice on Large Language Models (LLMs) downstream performance by training 24 mono- and multilingual LLMs.
We find that the tokenizer choice can significantly impact the model's downstream performance and training costs.
We show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English.
- Score: 30.33170936148845
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
Related papers
- Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models [0.0]
The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively.
We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts.
Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts.
arXiv Detail & Related papers (2024-10-16T19:34:34Z) - Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference [38.1823640848362]
State-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data.
Recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English.
arXiv Detail & Related papers (2024-02-16T14:15:15Z) - Scaling Laws for Multilingual Neural Machine Translation [45.620062316968976]
We study how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior.
We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law.
We leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale.
arXiv Detail & Related papers (2023-02-19T18:43:24Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.