Subword Segmental Language Modelling for Nguni Languages
- URL: http://arxiv.org/abs/2210.06525v1
- Date: Wed, 12 Oct 2022 18:41:00 GMT
- Title: Subword Segmental Language Modelling for Nguni Languages
- Authors: Francois Meyer and Jan Buys
- Abstract summary: Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
- Score: 7.252933737829635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subwords have become the standard units of text in NLP, enabling efficient
open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword
segmentation is viewed as a preprocessing step applied to the corpus before
training. This can lead to sub-optimal segmentations for low-resource languages
with complex morphologies. We propose a subword segmental language model (SSLM)
that learns how to segment words while being trained for autoregressive
language modelling. By unifying subword segmentation and language modelling,
our model learns subwords that optimise LM performance. We train our model on
the 4 Nguni languages of South Africa. These are low-resource agglutinative
languages, so subword information is critical. As an LM, SSLM outperforms
existing approaches such as BPE-based models on average across the 4 languages.
Furthermore, it outperforms standard subword segmenters on unsupervised
morphological segmentation. We also train our model as a word-level sequence
model, resulting in an unsupervised morphological segmenter that outperforms
existing methods by a large margin for all 4 languages. Our results show that
learning subword segmentation is an effective alternative to existing subword
segmenters, enabling the model to discover morpheme-like subwords that improve
its LM capabilities.
Related papers
- MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - TAMS: Translation-Assisted Morphological Segmentation [3.666125285899499]
We present a sequence-to-sequence model for canonical morpheme segmentation.
Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data.
While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
arXiv Detail & Related papers (2024-03-21T21:23:35Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Subword Segmental Machine Translation: Unifying Segmentation and Target
Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences.
Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z) - Effects of sub-word segmentation on performance of transformer language
models [0.628122931748758]
We compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation.
We show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks.
arXiv Detail & Related papers (2023-05-09T14:30:29Z) - Breaking Character: Are Subwords Good Enough for MRLs After All? [36.11778282905458]
We pretraining a BERT-style language model over character sequences instead of word-pieces.
We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs.
Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks, subword-based PLMs achieve significantly higher performance on semantic tasks.
arXiv Detail & Related papers (2022-04-10T18:54:43Z) - A Masked Segmental Language Model for Unsupervised Natural Language
Segmentation [12.6839867674222]
We introduce a Masked Segmental Language Model (MSLM) built on a span-masking transformer architecture.
In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese.
We conclude by discussing the different challenges posed in segmenting phonemic-type writing systems.
arXiv Detail & Related papers (2021-04-16T00:00:05Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.