Uzbek affix finite state machine for stemming
- URL: http://arxiv.org/abs/2205.10078v1
- Date: Fri, 20 May 2022 10:46:53 GMT
- Title: Uzbek affix finite state machine for stemming
- Authors: Maksud Sharipov, Ulugbek Salaev
- Abstract summary: The proposed methodology is a morphologic analysis of Uzbek words by using an affix to find a root and without including any lexicon.
This method helps to perform morphological analysis of words from a large amount of text at high speed as well as it is not required using of memory for keeping vocabulary.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents a morphological analyzer for the Uzbek language using a
finite state machine. The proposed methodology is a morphologic analysis of
Uzbek words by using an affix striping to find a root and without including any
lexicon. This method helps to perform morphological analysis of words from a
large amount of text at high speed as well as it is not required using of
memory for keeping vocabulary. According to Uzbek, an agglutinative language
can be designed with finite state machines (FSMs). In contrast to the previous
works, this study modeled the completed FSMs for all word classes by using the
Uzbek language's morphotactic rules in right to left order. This paper shows
the stages of this methodology including the classification of the affixes, the
generation of the FSMs for each affix class, and the combination into a head
machine to make analysis a word.
Related papers
- UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings [0.0]
Affixes play an important role in the morphological analysis of words, by adding additional meanings and grammatical functions to words.
This paper present modeling of the morphological analysis of Uzbek words, including stemming, lemmatizing, and the extraction of morphological information.
The developed tool based on the proposed model is available as a web-based application and an open-source Python library.
arXiv Detail & Related papers (2024-05-23T05:06:55Z) - Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language.
It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Open-Vocabulary Segmentation with Semantic-Assisted Calibration [73.39366775301382]
We study open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with contextual prior of CLIP.
We present a Semantic-assisted CAlibration Network (SCAN) to achieve state-of-the-art performance on open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2023-12-07T07:00:09Z) - UzbekTagger: The rule-based POS tagger for Uzbek language [0.0]
This research paper presents a part-of-speech annotated dataset and tagger tool for the low-resource Uzbek language.
The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool.
The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.
arXiv Detail & Related papers (2023-01-30T07:40:45Z) - UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek
Language [0.0]
We present a rule-based stemming algorithm for the Uzbek language.
The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach.
A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.
arXiv Detail & Related papers (2022-10-28T09:29:22Z) - Development of a rule-based lemmatization algorithm through Finite State
Machine for Uzbek language [0.0]
This paper discusses the construction of a lemmatization algorithm for the Uzbek language.
The main purpose of the work is to remove affixes of words by means of the finite state machine.
arXiv Detail & Related papers (2022-10-28T09:21:06Z) - Accuracy of the Uzbek stop words detection: a case study on "School
corpus" [0.0]
We present a method to evaluate the quality of a list of stop words aimed at automatically creating techniques.
The method was tested on an automatically-generated list of stop words for the Uzbek language.
arXiv Detail & Related papers (2022-09-15T05:14:31Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.