Development of a rule-based lemmatization algorithm through Finite State
Machine for Uzbek language
- URL: http://arxiv.org/abs/2210.16006v1
- Date: Fri, 28 Oct 2022 09:21:06 GMT
- Title: Development of a rule-based lemmatization algorithm through Finite State
Machine for Uzbek language
- Authors: Maksud Sharipov, Ogabek Sobirov
- Abstract summary: This paper discusses the construction of a lemmatization algorithm for the Uzbek language.
The main purpose of the work is to remove affixes of words by means of the finite state machine.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lemmatization is one of the core concepts in natural language processing,
thus creating a lemmatization tool is an important task. This paper discusses
the construction of a lemmatization algorithm for the Uzbek language. The main
purpose of the work is to remove affixes of words in the Uzbek language by
means of the finite state machine and to identify a lemma (a word that can be
found in the dictionary) of the word. The process of removing affixes uses a
database of affixes and part of speech knowledge. This lemmatization consists
of the general rules and a part of speech data of the Uzbek language, affixes,
classification of affixes, removing affixes on the basis of the finite state
machine for each class, as well as a definition of this word lemma.
Related papers
- Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language.
It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek
Language [0.0]
We present a rule-based stemming algorithm for the Uzbek language.
The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach.
A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.
arXiv Detail & Related papers (2022-10-28T09:29:22Z) - Accuracy of the Uzbek stop words detection: a case study on "School
corpus" [0.0]
We present a method to evaluate the quality of a list of stop words aimed at automatically creating techniques.
The method was tested on an automatically-generated list of stop words for the Uzbek language.
arXiv Detail & Related papers (2022-09-15T05:14:31Z) - Context based lemmatizer for Polish language [0.0]
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item.
In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.
The model achieves the best results for polish language lemmatisation process.
arXiv Detail & Related papers (2022-07-23T18:02:16Z) - Latent Topology Induction for Understanding Contextualized
Representations [84.7918739062235]
We study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models.
We show there exists a network of latent states that summarize linguistic properties of contextualized representations.
arXiv Detail & Related papers (2022-06-03T11:22:48Z) - Uzbek affix finite state machine for stemming [0.0]
The proposed methodology is a morphologic analysis of Uzbek words by using an affix to find a root and without including any lexicon.
This method helps to perform morphological analysis of words from a large amount of text at high speed as well as it is not required using of memory for keeping vocabulary.
arXiv Detail & Related papers (2022-05-20T10:46:53Z) - AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena.
Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order.
We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z) - Generalized Optimal Linear Orders [9.010643838773477]
The sequential structure of language, and the order of words in a sentence specifically, plays a central role in human language processing.
In designing computational models of language, the de facto approach is to present sentences to machines with the words ordered in the same order as in the original human-authored sentence.
The very essence of this work is to question the implicit assumption that this is desirable and inject theoretical soundness into the consideration of word order in natural language processing.
arXiv Detail & Related papers (2021-08-13T13:10:15Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.