Neural Morphological Tagging for Nguni Languages
- URL: http://arxiv.org/abs/2505.12949v1
- Date: Mon, 19 May 2025 10:41:47 GMT
- Title: Neural Morphological Tagging for Nguni Languages
- Authors: Cael Marquard, Simbarashe Mawere, Francois Meyer,
- Abstract summary: A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger.<n>This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages.
- Score: 2.812898346527047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Morphological parsing is the task of decomposing words into morphemes, the smallest units of meaning in a language, and labelling their grammatical roles. It is a particularly challenging task for agglutinative languages, such as the Nguni languages of South Africa, which construct words by concatenating multiple morphemes. A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger. This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages. We compare two classes of approaches: training neural sequence labellers (LSTMs and neural CRFs) from scratch and finetuning pretrained language models. We compare performance across these two categories, as well as to a traditional rule-based morphological parser. Neural taggers comfortably outperform the rule-based baseline and models trained from scratch tend to outperform pretrained models. We also compare parsing results across different upstream segmenters and with varying linguistic input features. Our findings confirm the viability of employing neural taggers based on pre-existing morphological segmenters for the Nguni languages.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
We train and evaluate neural networks directly as binary classifiers of strings.<n>We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures.<n>Our contributions will facilitate theoretically sound empirical testing of language recognition claims in future work.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - A Morphology-Based Investigation of Positional Encodings [46.667985003225496]
Morphology and word order are closely linked, with the latter incorporated into transformer-based models through positional encodings.
This prompts a fundamental inquiry: Is there a correlation between the morphological complexity of a language and the utilization of positional encoding in pre-trained language models?
In pursuit of an answer, we present the first study addressing this question, encompassing 22 languages and 5 downstream tasks.
arXiv Detail & Related papers (2024-04-06T07:10:47Z) - A Truly Joint Neural Architecture for Segmentation and Parsing [15.866519123942457]
Performance of Morphologically Rich Languages (MRLs) is lower than other languages.
Due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance.
We introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological and syntactic parsing tasks at once.
arXiv Detail & Related papers (2024-02-04T16:56:08Z) - Is neural language acquisition similar to natural? A chronological
probing study [0.0515648410037406]
We present the chronological probing study of transformer English models such as MultiBERT and T5.
We compare the information about the language learned by the models in the process of training on corpora.
The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language.
arXiv Detail & Related papers (2022-07-01T17:24:11Z) - Same Neurons, Different Languages: Probing Morphosyntax in Multilingual
Pre-trained Models [84.86942006830772]
We conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar.
We conduct the first large-scale empirical study over 43 languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe.
arXiv Detail & Related papers (2022-05-04T12:22:31Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Dependency-based Mixture Language Models [53.152011258252315]
We introduce the Dependency-based Mixture Language Models.
In detail, we first train neural language models with a novel dependency modeling objective.
We then formulate the next-token probability by mixing the previous dependency modeling probability distributions with self-attention.
arXiv Detail & Related papers (2022-03-19T06:28:30Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Comparison of Turkish Word Representations Trained on Different
Morphological Forms [0.0]
This study prepared texts in morphologically different forms in a morphologically rich language, Turkish.
We trained word2vec model on texts which lemma and suffixes are treated differently.
We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.
arXiv Detail & Related papers (2020-02-13T10:09:31Z) - Morphological Word Segmentation on Agglutinative Languages for Neural
Machine Translation [8.87546236839959]
We propose a morphological word segmentation method on the source-side for Neural machine translation (NMT)
It incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time.
It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks.
arXiv Detail & Related papers (2020-01-02T10:05:02Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.