SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
- URL: http://arxiv.org/abs/2601.04469v1
- Date: Thu, 08 Jan 2026 01:05:51 GMT
- Title: SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
- Authors: Iaroslav Chelombitko, Ekaterina Chelombitko, Aleksey Komissarov,
- Abstract summary: We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation.<n>Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers.<n>We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP
Related papers
- Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan [6.367163817135528]
We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction.<n>We show that retrieval-augmented prompting provides substantial gains over random example selection.<n>We also find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases.
arXiv Detail & Related papers (2026-03-01T05:03:11Z) - What Language is This? Ask Your Tokenizer [32.28976119949841]
Language Identification (LID) is an important component of many multilingual natural language processing pipelines.<n>We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm.<n>Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models.
arXiv Detail & Related papers (2026-02-19T18:58:39Z) - Corpus-Based Approaches to Igbo Diacritic Restoration [0.23552726065717702]
The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries.<n>Over 95% of the world's 7000 languages are low-resourced for NLP, i.e. they have little or no data, tools, and techniques for NLP work.<n>We present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages.
arXiv Detail & Related papers (2026-01-26T11:30:36Z) - Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language [0.0]
Shona spaCy is an open-source computational morphological analysis tool for the Bantu language.<n>It combines a lexicon with rules to model noun-class prefixes, verbal subjects, tense-aspect markers, ideophones, and clitics.<n>Its accuracy is 90% POS-tagging accuracy and 88% morphological-feature accuracy.
arXiv Detail & Related papers (2025-11-12T09:19:49Z) - MorphTok: Morphologically Grounded Tokenization for Indian Languages [18.594241501479747]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)<n>We propose morphology-aware segmentation as a pre-tokenization step before applying the classical Byte-pair.<n>To handle the dependent vowels common in syllable-based writing systems, we propose Constrained BPE (CBPE)<n>CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language [1.788784870849724]
This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami.
The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami.
Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification.
arXiv Detail & Related papers (2024-11-04T19:41:16Z) - Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques.
Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.