Lexicon and Rule-based Word Lemmatization Approach for the Somali
Language
- URL: http://arxiv.org/abs/2308.01785v1
- Date: Thu, 3 Aug 2023 14:31:57 GMT
- Title: Lexicon and Rule-based Word Lemmatization Approach for the Somali
Language
- Authors: Shafie Abdi Mohamed, Muhidin Abdullahi Mohamed
- Abstract summary: Lemmatization is a technique used to normalize text by changing morphological derivations of words to their root forms.
This paper pioneers the development of text lemmatization for the Somali language.
We have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lemmatization is a Natural Language Processing (NLP) technique used to
normalize text by changing morphological derivations of words to their root
forms. It is used as a core pre-processing step in many NLP tasks including
text indexing, information retrieval, and machine learning for NLP, among
others. This paper pioneers the development of text lemmatization for the
Somali language, a low-resource language with very limited or no prior
effective adoption of NLP methods and datasets. We especially develop a lexicon
and rule-based lemmatizer for Somali text, which is a starting point for a
full-fledged Somali lemmatization system for various NLP tasks. With
consideration of the language morphological rules, we have developed an initial
lexicon of 1247 root words and 7173 derivationally related terms enriched with
rules for lemmatizing words not present in the lexicon. We have tested the
algorithm on 120 documents of various lengths including news articles, social
media posts, and text messages. Our initial results demonstrate that the
algorithm achieves an accuracy of 57\% for relatively long documents (e.g. full
news articles), 60.57\% for news article extracts, and high accuracy of 95.87\%
for short texts such as social media messages.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Text Categorization Can Enhance Domain-Agnostic Stopword Extraction [3.6048839315645442]
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP)
By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages.
arXiv Detail & Related papers (2024-01-24T11:52:05Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Automatic Textual Normalization for Hate Speech Detection [0.8990550886501417]
Social media data contains a wide range of non-standard words (NSW)
Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization.
Our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model.
arXiv Detail & Related papers (2023-11-12T14:01:38Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - Evaluating Input Representation for Language Identification in
Hindi-English Code Mixed Text [4.4904382374090765]
Code-mixed text comprises text written in more than one language.
People naturally tend to combine local language with global languages like English.
In this work, we focus on language identification in code-mixed sentences for Hindi-English mixed text.
arXiv Detail & Related papers (2020-11-23T08:08:09Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z) - LSCP: Enhanced Large Scale Colloquial Persian Language Understanding [2.7249643773851724]
"Large Scale Colloquial Persian dataset" aims to describe the colloquial language of low-resourced languages.
The proposed corpus consists of 120M sentences resulted from 27M tweets annotated with parsing tree, part-of-speech tags, sentiment polarity and translation in five different languages.
arXiv Detail & Related papers (2020-03-13T22:24:14Z) - Combining Pretrained High-Resource Embeddings and Subword
Representations for Low-Resource Languages [24.775371434410328]
We explore techniques exploiting the qualities of morphologically rich languages (MRLs)
We show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.
arXiv Detail & Related papers (2020-03-09T21:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.