The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
- URL: http://arxiv.org/abs/2511.09197v2
- Date: Wed, 19 Nov 2025 09:30:00 GMT
- Title: The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
- Authors: Francois Meyer, Jan Buys,
- Abstract summary: We extend the subword segmental language model (SSLM) to support pretraining and finetuning.<n>We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum.<n>We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability.
- Score: 11.09360259927697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.
Related papers
- False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models [53.01170039144264]
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages.<n>Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages?<n>We find that models with overlap outperform models with disjoint vocabularies.
arXiv Detail & Related papers (2025-09-23T07:47:54Z) - BabyLM's First Words: Word Segmentation as a Phonological Probing Task [2.335764524038488]
We show how word segmentation can be used as a phonological probing task.<n>We study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages.
arXiv Detail & Related papers (2025-04-04T10:42:56Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Morphological Disambiguation from Stemming Data [1.2183405753834562]
Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis.
We learn to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing.
Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation.
arXiv Detail & Related papers (2020-11-11T01:44:09Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Comparison of Turkish Word Representations Trained on Different
Morphological Forms [0.0]
This study prepared texts in morphologically different forms in a morphologically rich language, Turkish.
We trained word2vec model on texts which lemma and suffixes are treated differently.
We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.
arXiv Detail & Related papers (2020-02-13T10:09:31Z) - Morphological Word Segmentation on Agglutinative Languages for Neural
Machine Translation [8.87546236839959]
We propose a morphological word segmentation method on the source-side for Neural machine translation (NMT)
It incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time.
It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks.
arXiv Detail & Related papers (2020-01-02T10:05:02Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.