Retrofitting Large Language Models with Dynamic Tokenization
- URL: http://arxiv.org/abs/2411.18553v2
- Date: Sat, 14 Dec 2024 23:43:54 GMT
- Title: Retrofitting Large Language Models with Dynamic Tokenization
- Authors: Darius Feher, Ivan Vulić, Benjamin Minixhofer,
- Abstract summary: We propose retrofitting current language models with dynamic tokenization.
We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly.
We find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages.
- Score: 3.608780819053423
- License:
- Abstract: Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the static design and propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text via a subword-merging algorithm inspired by byte-pair encoding. We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. For encoder-style models (e.g., XLM-R), this on average reduces token sequence lengths by >20% across 14 languages while degrading performance by less than 2%. The same method applied to pre-filling and scoring in decoder-style models (e.g., Mistral-7B; evaluated on English) results in minimal performance degradation at up to 6% reduction in sequence length. Overall, we find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages, enabling more equitable and adaptable LMs.
Related papers
- Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.
We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.
We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z) - Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining.
We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Efficient Transformers with Dynamic Token Pooling [11.28381882347617]
We equip language models with a dynamic-pooling mechanism, which predicts segment boundaries in an autoregressive fashion.
Results demonstrate that dynamic pooling, which jointly segments and models language, is both faster and more accurate than vanilla Transformers.
arXiv Detail & Related papers (2022-11-17T18:39:23Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.