Creating a morphological and syntactic tagged corpus for the Uzbek
language
- URL: http://arxiv.org/abs/2210.15234v1
- Date: Thu, 27 Oct 2022 07:44:12 GMT
- Title: Creating a morphological and syntactic tagged corpus for the Uzbek
language
- Authors: Maksud Sharipov, Jamolbek Mattiev, Jasur Sobirov, Rustam Baltayev
- Abstract summary: We develop a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and morphologically tagged corpus of the Uzbek language.
Based on the developed annotation tool and the software, we share our experience results of the first stage of tagged corpus creation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, creation of the tagged corpora is becoming one of the most
important tasks of Natural Language Processing (NLP). There are not enough
tagged corpora to build machine learning models for the low-resource Uzbek
language. In this paper, we tried to fill that gap by developing a novel Part
Of Speech (POS) and syntactic tagset for creating the syntactic and
morphologically tagged corpus of the Uzbek language. This work also includes
detailed description and presentation of a web-based application to work on a
tagging as well. Based on the developed annotation tool and the software, we
share our experience results of the first stage of the tagged corpus creation
Related papers
- FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis [0.0]
The Algerian dialect (AD) faces challenges due to the absence of annotated corpora.
This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA.
arXiv Detail & Related papers (2024-11-07T10:39:10Z) - One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks [26.848664285007022]
ByT5-Sanskrit is designed for NLP applications involving the morphologically rich language Sanskrit.
It is easier to deploy and more robust to data not covered by external linguistic resources.
We show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages.
arXiv Detail & Related papers (2024-09-20T22:02:26Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - UzbekTagger: The rule-based POS tagger for Uzbek language [0.0]
This research paper presents a part-of-speech annotated dataset and tagger tool for the low-resource Uzbek language.
The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool.
The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.
arXiv Detail & Related papers (2023-01-30T07:40:45Z) - TArC: Tunisian Arabish Corpus First complete release [0.0]
We present the final result of a project on Tunisian Arabic encoded in Arabizi.
The project led to the creation of two integrated and independent resources.
arXiv Detail & Related papers (2022-07-11T11:46:59Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - A Simple Multi-Modality Transfer Learning Baseline for Sign Language
Translation [54.29679610921429]
Existing sign language datasets contain only about 10K-20K pairs of sign videos, gloss annotations and texts.
Data is thus a bottleneck for training effective sign language translation models.
This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks.
arXiv Detail & Related papers (2022-03-08T18:59:56Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.