UzbekTagger: The rule-based POS tagger for Uzbek language
- URL: http://arxiv.org/abs/2301.12711v1
- Date: Mon, 30 Jan 2023 07:40:45 GMT
- Title: UzbekTagger: The rule-based POS tagger for Uzbek language
- Authors: Maksud Sharipov, Elmurod Kuriyozov, Ollabergan Yuldashev, Ogabek
Sobirov
- Abstract summary: This research paper presents a part-of-speech annotated dataset and tagger tool for the low-resource Uzbek language.
The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool.
The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This research paper presents a part-of-speech (POS) annotated dataset and
tagger tool for the low-resource Uzbek language. The dataset includes 12 tags,
which were used to develop a rule-based POS-tagger tool. The corpus text used
in the annotation process was made sure to be balanced over 20 different fields
in order to ensure its representativeness. Uzbek being an agglutinative
language so the most of the words in an Uzbek sentence are formed by adding
suffixes. This nature of it makes the POS-tagging task difficult to find the
stems of words and the right part-of-speech they belong to. The methodology
proposed in this research is the stemming of the words with an affix/suffix
stripping approach including database of the stem forms of the words in the
Uzbek language. The tagger tool was tested on the annotated dataset and showed
high accuracy in identifying and tagging parts of speech in Uzbek text. This
newly presented dataset and tagger tool can be used for a variety of natural
language processing tasks such as language modeling, machine translation, and
text-to-speech synthesis. The presented dataset is the first of its kind to be
made publicly available for Uzbek, and the POS-tagger tool created can also be
used as a pivot to use as a base for other closely-related Turkic languages.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek
Language [0.0]
We present a rule-based stemming algorithm for the Uzbek language.
The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach.
A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.
arXiv Detail & Related papers (2022-10-28T09:29:22Z) - Creating a morphological and syntactic tagged corpus for the Uzbek
language [0.0]
We develop a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and morphologically tagged corpus of the Uzbek language.
Based on the developed annotation tool and the software, we share our experience results of the first stage of tagged corpus creation.
arXiv Detail & Related papers (2022-10-27T07:44:12Z) - Accuracy of the Uzbek stop words detection: a case study on "School
corpus" [0.0]
We present a method to evaluate the quality of a list of stop words aimed at automatically creating techniques.
The method was tested on an automatically-generated list of stop words for the Uzbek language.
arXiv Detail & Related papers (2022-09-15T05:14:31Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - Uzbek affix finite state machine for stemming [0.0]
The proposed methodology is a morphologic analysis of Uzbek words by using an affix to find a root and without including any lexicon.
This method helps to perform morphological analysis of words from a large amount of text at high speed as well as it is not required using of memory for keeping vocabulary.
arXiv Detail & Related papers (2022-05-20T10:46:53Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - ESPnet-ST: All-in-One Speech Translation Toolkit [57.76342114226599]
ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet.
It implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation.
We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines.
arXiv Detail & Related papers (2020-04-21T18:38:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.