Related papers: Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

URL: http://arxiv.org/abs/2404.13292v1
Date: Sat, 20 Apr 2024 06:49:15 GMT
Title: Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Authors: Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella,
Abstract summary: We propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrepid evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that alien tokenization leads to poorer generalizations.
Score: 10.721272718226848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

Related papers

Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay [4.061135251278187]
Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages.<n>We present the first comprehensive, principled study of Turkish subword tokenization.
arXiv Detail & Related papers (2026-02-06T18:41:14Z)
Tokens with Meaning: A Hybrid Tokenization Approach for NLP [0.2826977330147589]
Tokenization plays a pivotal role in natural language processing (NLP)<n>We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation.<n>The method uses phono normalization, root-affix, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
arXiv Detail & Related papers (2025-08-19T22:17:42Z)
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
Comparative analysis of subword tokenization approaches for Indian languages [5.012314384895538]
Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process.<n>Subword tokenization enhances this process by breaking down words into smaller subword units.<n>It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations.<n>This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair, and WordPiece Tokenization, affect ILs.
arXiv Detail & Related papers (2025-05-22T16:24:37Z)
Morphological evaluation of subwords vocabulary used by BETO language model [0.1638581561083717]
Subword tokenization algorithms are more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. This evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims
arXiv Detail & Related papers (2024-10-03T08:07:14Z)
Lexically Grounded Subword Segmentation [0.0]
We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an method for obtaining subword embeddings grounded in a word embedding space. Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
arXiv Detail & Related papers (2024-06-19T13:48:19Z)
Greed is All You Need: An Evaluation of Tokenizer Inference Methods [4.300681074103876]
We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
arXiv Detail & Related papers (2024-03-02T19:01:40Z)
Analyzing Cognitive Plausibility of Subword Tokenization [9.510439539246846]
Subword tokenization has become the de-facto standard for tokenization. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization.
arXiv Detail & Related papers (2023-10-20T08:25:37Z)
Tokenization with Factorized Subword Encoding [2.538209532048867]
We propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.
arXiv Detail & Related papers (2023-06-13T13:27:34Z)
Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task. We study three unsupervised approaches that rely on a masked language model. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z)
Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order. We propose Forced Invalidation to help preserve the importance of word order. Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z)
CCPrefix: Counterfactual Contrastive Prefix-Tuning for Many-Class Classification [57.62886091828512]
We propose a brand-new prefix-tuning method, Counterfactual Contrastive Prefix-tuning (CCPrefix) for many-class classification. Basically, an instance-dependent soft prefix, derived from fact-counterfactual pairs in the label space, is leveraged to complement the language verbalizers in many-class classification.
arXiv Detail & Related papers (2022-11-11T03:45:59Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.