Approaches to Corpus Creation for Low-Resource Language Technology: the
Case of Southern Kurdish and Laki
- URL: http://arxiv.org/abs/2304.01319v1
- Date: Mon, 3 Apr 2023 19:36:32 GMT
- Title: Approaches to Corpus Creation for Low-Resource Language Technology: the
Case of Southern Kurdish and Laki
- Authors: Sina Ahmadi and Zahra Azin and Sara Belelli and Antonios
Anastasopoulos
- Abstract summary: We describe some of the challenges of such under-represented languages, particularly in writing and standardization.
We also study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.
- Score: 29.27024733066261
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the major challenges that under-represented and endangered language
communities face in language technology is the lack or paucity of language
data. This is also the case of the Southern varieties of the Kurdish and Laki
languages for which very limited resources are available with insubstantial
progress in tools. To tackle this, we provide a few approaches that rely on the
content of local news websites, a local radio station that broadcasts content
in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of
the challenges of such under-represented languages, particularly in writing and
standardization, and also, in retrieving sources of data and retro-digitizing
handwritten content to create a corpus for Southern Kurdish and Laki. In
addition, we study the task of language identification in light of the other
variants of Kurdish and Zaza-Gorani languages.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts [50.44270798959864]
Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages.
We study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language.
arXiv Detail & Related papers (2024-04-19T04:02:50Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Language and Speech Technology for Central Kurdish Varieties [27.751434601712]
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum.
Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language.
In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish.
arXiv Detail & Related papers (2024-03-04T12:27:32Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Cross-lingual Offensive Language Identification for Low Resource
Languages: The Case of Marathi [2.4737119633827174]
MOLD is the first dataset of its kind compiled for Marathi, opening a new domain for research in low-resource Indo-Aryan languages.
We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers.
arXiv Detail & Related papers (2021-09-08T11:29:44Z) - Towards Machine Translation for the Kurdish Language [0.0]
Machine translation is the task of translating texts from one language to another using computers.
Kurdish, an Indo-European language, has received little attention in this realm due to the language being less-resourced.
We describe the available scarce parallel data suitable for training a neural machine translation model for Sorani Kurdish-English translation.
arXiv Detail & Related papers (2020-10-12T21:28:57Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.