Related papers: UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

URL: http://arxiv.org/abs/2210.16011v1
Date: Fri, 28 Oct 2022 09:29:22 GMT
Title: UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language
Authors: Maksud Sharipov, Ollabergan Yuldashov
Abstract summary: We present a rule-based stemming algorithm for the Uzbek language. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.

Related papers

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world. One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language. It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
UzbekTagger: The rule-based POS tagger for Uzbek language [0.0]
This research paper presents a part-of-speech annotated dataset and tagger tool for the low-resource Uzbek language. The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool. The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.
arXiv Detail & Related papers (2023-01-30T07:40:45Z)
Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language [0.0]
This paper discusses the construction of a lemmatization algorithm for the Uzbek language. The main purpose of the work is to remove affixes of words by means of the finite state machine.
arXiv Detail & Related papers (2022-10-28T09:21:06Z)
Creating a morphological and syntactic tagged corpus for the Uzbek language [0.0]
We develop a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and morphologically tagged corpus of the Uzbek language. Based on the developed annotation tool and the software, we share our experience results of the first stage of tagged corpus creation.
arXiv Detail & Related papers (2022-10-27T07:44:12Z)
MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script [0.05833117322405446]
We exploit the powerfulness of word embedding models generated with a corpus of YouTube comments. We have built a normalization dictionary that we refer to as MANorm.
arXiv Detail & Related papers (2022-06-18T10:17:46Z)
The Open corpus of the Veps and Karelian languages: overview and applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search. Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z)
Uzbek affix finite state machine for stemming [0.0]
The proposed methodology is a morphologic analysis of Uzbek words by using an affix to find a root and without including any lexicon. This method helps to perform morphological analysis of words from a large amount of text at high speed as well as it is not required using of memory for keeping vocabulary.
arXiv Detail & Related papers (2022-05-20T10:46:53Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system. One of the popular approach to cover OOVs is to use subword units rather then words. In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z)
Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches. We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.