Related papers: MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

URL: http://arxiv.org/abs/2502.00894v1
Date: Sun, 02 Feb 2025 20:06:39 GMT
Title: MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Authors: Ehsaneddin Asgari, Yassine El Kheir, Mohammad Ali Sadraei Javaheri,
Abstract summary: Tokenization is fundamental to Natural Language Processing (NLP)<n>MorphBPE is a morphology-aware extension of Byte Pair.<n>It integrates linguistic structure into subword tokenization while preserving statistical efficiency.
Score: 0.8029049649310211
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: https://github.com/llm-lab-org/MorphBPE and https://tokenizer.llm-lab.org

Related papers

MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs) We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization. We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z)
From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time [8.28573483085828]
We leverage an annotated Danish morphological dataset to train a semi-supervised model for morphological segmentation. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance intextly segmenting Danish words. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer.
arXiv Detail & Related papers (2025-04-02T09:26:02Z)
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z)
A Truly Joint Neural Architecture for Segmentation and Parsing [15.866519123942457]
Performance of Morphologically Rich Languages (MRLs) is lower than other languages. Due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance. We introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological and syntactic parsing tasks at once.
arXiv Detail & Related papers (2024-02-04T16:56:08Z)
Leveraging Code to Improve In-context Learning for Semantic Parsing [48.66031267718704]
In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. We improve the effectiveness of ICL for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs, and (2) augmenting prompts with a structured domain description.
arXiv Detail & Related papers (2023-11-16T02:50:06Z)
MorphPiece : A Linguistic Tokenizer for Large Language Models [3.8073142980733]
I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks.
arXiv Detail & Related papers (2023-07-14T10:35:04Z)
Effects of sub-word segmentation on performance of transformer language models [0.628122931748758]
We compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation. We show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks.
arXiv Detail & Related papers (2023-05-09T14:30:29Z)
UniMorph 4.0: Universal Morphology [104.69846084893298]
This paper presents the expansions and improvements made on several fronts over the last couple of years. Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages.
arXiv Detail & Related papers (2022-05-07T09:19:02Z)
Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation. A large amount of differently inflected word surface forms entails a larger vocabulary. Some inflected forms of infrequent terms typically do not appear in the training corpus. Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
SML: a new Semantic Embedding Alignment Transformer for efficient cross-lingual Natural Language Inference [71.57324258813674]
The ability of Transformers to perform with precision a variety of tasks such as question answering, Natural Language Inference (NLI) or summarising, have enable them to be ranked as one of the best paradigms to address this kind of tasks at present. NLI is one of the best scenarios to test these architectures, due to the knowledge required to understand complex sentences and established a relation between a hypothesis and a premise. In this paper, we propose a new architecture, siamese multilingual transformer, to efficiently align multilingual embeddings for Natural Language Inference.
arXiv Detail & Related papers (2021-03-17T13:23:53Z)
Enhancing deep neural networks with morphological information [0.0]
We analyse the effect of adding morphological features to LSTM and BERT models. Our results suggest that adding morphological features has mixed effects depending on the quality of features and the task.
arXiv Detail & Related papers (2020-11-24T22:35:44Z)
Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE) We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.