Related papers: Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

Related papers

Training Language Models with homotokens Leads to Delayed Overfitting [2.531076482407163]
Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning.<n>We formalize homotoken-as a strictly meaning-preserving form of data augmentation.<n>In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure.<n>In multilingual fine-tuning, we find that the effectiveness of homotokens depends on tokenizer quality.
arXiv Detail & Related papers (2026-01-06T09:57:00Z)
Lossless Vocabulary Reduction for Auto-Regressive Language Models [21.015330660860865]
Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models.<n>We establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary.<n>As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
arXiv Detail & Related papers (2025-10-09T11:38:48Z)
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models [53.01170039144264]
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages.<n>Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages?<n>We find that models with overlap outperform models with disjoint vocabularies.
arXiv Detail & Related papers (2025-09-23T07:47:54Z)
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
Comparative analysis of subword tokenization approaches for Indian languages [5.012314384895538]
Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process.<n>Subword tokenization enhances this process by breaking down words into smaller subword units.<n>It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations.<n>This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair, and WordPiece Tokenization, affect ILs.
arXiv Detail & Related papers (2025-05-22T16:24:37Z)
Why do language models perform worse for morphologically complex languages? [0.913127392774573]
We find new evidence for a performance gap between agglutinative and fusional languages. We propose three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement. Results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology.
arXiv Detail & Related papers (2024-11-21T15:06:51Z)
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5 [4.779196219827507]
We capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5. Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that the models learn the morphological systems of some languages better than others.
arXiv Detail & Related papers (2024-10-15T14:14:19Z)
Tokenization with Factorized Subword Encoding [2.538209532048867]
We propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.
arXiv Detail & Related papers (2023-06-13T13:27:34Z)
Language Model Tokenizers Introduce Unfairness Between Languages [98.92630681729518]
We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. We make the case that we should train future language models using multilingually fair subword tokenizers.
arXiv Detail & Related papers (2023-05-17T14:17:57Z)
VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments. Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs. token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
Impact of Tokenization on Language Models: An Analysis for Turkish [2.4660652494309936]
We train tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. We find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers.
arXiv Detail & Related papers (2022-04-19T12:01:46Z)
Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z)
On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
Recurrent Neural Network Language Models Always Learn English-Like Relative Clause Attachment [17.995905582226463]
We compare model performance in English and Spanish to show that non-linguistic biases in RNN LMs advantageously overlap with syntactic structure in English but not Spanish. English models may appear to acquire human-like syntactic preferences, while models trained on Spanish fail to acquire comparable human-like preferences.
arXiv Detail & Related papers (2020-05-01T01:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.