Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments
on Kurdish (Sorani) Texts
- URL: http://arxiv.org/abs/2004.14134v2
- Date: Thu, 30 Apr 2020 08:09:11 GMT
- Title: Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments
on Kurdish (Sorani) Texts
- Authors: Roshna Omer Abdulrahman, Hossein Hassani
- Abstract summary: Punkt is an unsupervised machine learning method.
We used Punkt to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script.
In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%.
- Score: 0.76146285961466
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Segmentation is a fundamental step for most Natural Language Processing
tasks. The Kurdish language is a multi-dialect, under-resourced language which
is written in different scripts. The lack of various segmented corpora is one
of the major bottlenecks in Kurdish language processing. We used Punkt, an
unsupervised machine learning method, to segment a Kurdish corpus of Sorani
dialect, written in Persian-Arabic script. According to the literature, studies
on using Punkt on non-Latin data are scanty. In our experiment, we achieved an
F1 score of 91.10% and had an Error Rate of 16.32%. The high Error Rate is
mainly due to the situation of abbreviations in Kurdish and partly because of
ordinal numerals. The data is publicly available at
https://github.com/KurdishBLARK/ KTC-Segmented for non-commercial use under the
CC BY-NC-SA 4.0 licence.
Related papers
- Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic
Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages.
Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points.
By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z) - On the Off-Target Problem of Zero-Shot Multilingual Neural Machine
Translation [104.85258654917297]
We find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance.
We propose Language Aware Vocabulary Sharing (LAVS) to construct the multilingual vocabulary.
We conduct experiments on a multilingual machine translation benchmark in 11 languages.
arXiv Detail & Related papers (2023-05-18T12:43:31Z) - The First Parallel Corpora for Kurdish Sign Language [0.76146285961466]
Kurdish Sign Language (KuSL) is the natural language of the Kurdish Deaf people.
We propose an avatar-based automatic translation of Kurdish texts in the Sorani (Central Kurdish) dialect into the Kurdish Sign language.
We tested the outcome understandability and evaluated it using the Bilingual Evaluation Understudy (BLEU)
arXiv Detail & Related papers (2023-05-11T12:10:20Z) - A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to
Kurdish-BLARK Named Entities [0.76146285961466]
We present a data set that covers several categories of NEs in Kurdish (Sorani)
The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit)
arXiv Detail & Related papers (2023-01-12T12:13:44Z) - A Benchmark and Dataset for Post-OCR text correction in Sanskrit [23.45279030301887]
Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation.
We release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different books.
arXiv Detail & Related papers (2022-11-15T08:32:18Z) - Part of Speech Tagging (POST) of a Low-resource Language using another
Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged
Persian (Farsi) Corpus) [0.76146285961466]
Part of Speech Tagging (POST) is essential in developing tagged corpora.
The Kurdish language currently lacks publicly available tagged corpora of proper sizes.
We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon.
arXiv Detail & Related papers (2022-01-30T11:49:43Z) - Central Kurdish machine translation: First large scale parallel corpus
and experiments [2.099922236065961]
We present the first large scale parallel corpus of Central Kurdish-English, Awta, containing 229,222 pairs of manually aligned translations.
Our best performing systems achieve 22.72 and 16.81 in BLEU score for Ku$rightarrow$EN and En$rightarrow$Ku, respectively.
arXiv Detail & Related papers (2021-06-17T08:41:53Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.