Part of Speech Tagging (POST) of a Low-resource Language using another
Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged
Persian (Farsi) Corpus)
- URL: http://arxiv.org/abs/2201.12793v1
- Date: Sun, 30 Jan 2022 11:49:43 GMT
- Title: Part of Speech Tagging (POST) of a Low-resource Language using another
Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged
Persian (Farsi) Corpus)
- Authors: Hossein Hassani
- Abstract summary: Part of Speech Tagging (POST) is essential in developing tagged corpora.
The Kurdish language currently lacks publicly available tagged corpora of proper sizes.
We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon.
- Score: 0.76146285961466
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Tagged corpora play a crucial role in a wide range of Natural Language
Processing. The Part of Speech Tagging (POST) is essential in developing tagged
corpora. It is time-and-effort-consuming and costly, and therefore, it could be
more affordable if it is automated. The Kurdish language currently lacks
publicly available tagged corpora of proper sizes. Tagging the publicly
available Kurdish corpora can leverage the capability of those resources to a
higher level than what raw or segmented corpora can provide. Developing
POS-tagged lexicons can assist the mentioned task. We use a tagged corpus
(Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop
a POS-tagged lexicon. This paper presents the approach of leveraging the
resource of a close language to Kurdish to enrich its resources. A partial
dataset of the results is publicly available for non-commercial use under CC
BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the
whole tagged corpus available after further investigation on the outcome. The
dataset can help in developing POS-tagged lexicons for other Kurdish dialects
and automated Kurdish corpora tagging.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Language and Speech Technology for Central Kurdish Varieties [27.751434601712]
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum.
Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language.
In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish.
arXiv Detail & Related papers (2024-03-04T12:27:32Z) - Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial
Persian Part of Speech Tagging [0.9843385481559193]
This paper introduces a novel corpus, "Colloquial Persian POS" (CPPOS), specifically designed to support colloquial Persian text.
The corpus includes formal and informal text collected from various domains such as political, social, and commercial on Telegram, Twitter, and Instagram.
arXiv Detail & Related papers (2023-10-01T05:06:33Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to
Kurdish-BLARK Named Entities [0.76146285961466]
We present a data set that covers several categories of NEs in Kurdish (Sorani)
The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit)
arXiv Detail & Related papers (2023-01-12T12:13:44Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments
on Kurdish (Sorani) Texts [0.76146285961466]
Punkt is an unsupervised machine learning method.
We used Punkt to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script.
In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%.
arXiv Detail & Related papers (2020-04-09T06:44:08Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.