Canonical and Surface Morphological Segmentation for Nguni Languages
- URL: http://arxiv.org/abs/2104.00767v1
- Date: Thu, 1 Apr 2021 21:06:51 GMT
- Title: Canonical and Surface Morphological Segmentation for Nguni Languages
- Authors: Tumi Moeng, Sheldon Reay, Aaron Daniels, Jan Buys
- Abstract summary: This paper investigates supervised and unsupervised models for morphological segmentation.
We train sequence-to-sequence models for canonical segmentation and Conditional Random Fields (CRF) for surface segmentation.
Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages.
We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
- Score: 6.805575417034369
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Morphological Segmentation involves decomposing words into morphemes, the
smallest meaning-bearing units of language. This is an important NLP task for
morphologically-rich agglutinative languages such as the Southern African Nguni
language group. In this paper, we investigate supervised and unsupervised
models for two variants of morphological segmentation: canonical and surface
segmentation. We train sequence-to-sequence models for canonical segmentation,
where the underlying morphemes may not be equal to the surface form of the
word, and Conditional Random Fields (CRF) for surface segmentation.
Transformers outperform LSTMs with attention on canonical segmentation,
obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs
outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface
segmentation. In the unsupervised setting, an entropy-based approach using a
character-level LSTM language model fails to outperforms a Morfessor baseline,
while on some of the languages neither approach performs much better than a
random baseline. We hope that the high performance of the supervised
segmentation models will help to facilitate the development of better NLP tools
for Nguni languages.
Related papers
- Labeled Morphological Segmentation with Semi-Markov Models [127.69031138022534]
We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks.
We additionally introduce a new hierarchy of morphotactic tagsets.
We develop modelname, a discriminative morphological segmentation system that explicitly models morphotactics.
arXiv Detail & Related papers (2024-04-13T12:51:53Z) - A Truly Joint Neural Architecture for Segmentation and Parsing [15.866519123942457]
Performance of Morphologically Rich Languages (MRLs) is lower than other languages.
Due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance.
We introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological and syntactic parsing tasks at once.
arXiv Detail & Related papers (2024-02-04T16:56:08Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - Exploring Segmentation Approaches for Neural Machine Translation of
Code-Switched Egyptian Arabic-English Text [29.95141309131595]
We study the effectiveness of different segmentation approaches on machine translation (MT) performance.
We experiment on MT from code-switched Arabic-English to English.
We find that the choice of the segmentation setup to use for MT is highly dependent on the data size.
arXiv Detail & Related papers (2022-10-11T23:20:12Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - BPE vs. Morphological Segmentation: A Case Study on Machine Translation
of Four Polysynthetic Languages [38.5427201289742]
We investigate a variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages.
We compare the morphologically inspired segmentation methods against Byte-Pair s (BPEs) as inputs for machine translation.
We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently.
arXiv Detail & Related papers (2022-03-16T21:27:20Z) - A Masked Segmental Language Model for Unsupervised Natural Language
Segmentation [12.6839867674222]
We introduce a Masked Segmental Language Model (MSLM) built on a span-masking transformer architecture.
In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese.
We conclude by discussing the different challenges posed in segmenting phonemic-type writing systems.
arXiv Detail & Related papers (2021-04-16T00:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.