Related papers: Canonical and Surface Morphological Segmentation for Nguni Languages

Canonical and Surface Morphological Segmentation for Nguni Languages

URL: http://arxiv.org/abs/2104.00767v1
Date: Thu, 1 Apr 2021 21:06:51 GMT
Title: Canonical and Surface Morphological Segmentation for Nguni Languages
Authors: Tumi Moeng, Sheldon Reay, Aaron Daniels, Jan Buys
Abstract summary: This paper investigates supervised and unsupervised models for morphological segmentation. We train sequence-to-sequence models for canonical segmentation and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
Score: 6.805575417034369
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.

Related papers

Labeled Morphological Segmentation with Semi-Markov Models [127.69031138022534]
We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. We additionally introduce a new hierarchy of morphotactic tagsets. We develop modelname, a discriminative morphological segmentation system that explicitly models morphotactics.
arXiv Detail & Related papers (2024-04-13T12:51:53Z)
A Truly Joint Neural Architecture for Segmentation and Parsing [15.866519123942457]
Performance of Morphologically Rich Languages (MRLs) is lower than other languages. Due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance. We introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological and syntactic parsing tasks at once.
arXiv Detail & Related papers (2024-02-04T16:56:08Z)
In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL) We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z)
Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling. We train our model on the 4 Nguni languages of South Africa. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z)
Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text [29.95141309131595]
We study the effectiveness of different segmentation approaches on machine translation (MT) performance. We experiment on MT from code-switched Arabic-English to English. We find that the choice of the segmentation setup to use for MT is highly dependent on the data size.
arXiv Detail & Related papers (2022-10-11T23:20:12Z)
Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation. A large amount of differently inflected word surface forms entails a larger vocabulary. Some inflected forms of infrequent terms typically do not appear in the training corpus. Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z)
BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages [38.5427201289742]
We investigate a variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages. We compare the morphologically inspired segmentation methods against Byte-Pair s (BPEs) as inputs for machine translation. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently.
arXiv Detail & Related papers (2022-03-16T21:27:20Z)
A Masked Segmental Language Model for Unsupervised Natural Language Segmentation [12.6839867674222]
We introduce a Masked Segmental Language Model (MSLM) built on a span-masking transformer architecture. In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese. We conclude by discussing the different challenges posed in segmenting phonemic-type writing systems.
arXiv Detail & Related papers (2021-04-16T00:00:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.