Language and Speech Technology for Central Kurdish Varieties
- URL: http://arxiv.org/abs/2403.01983v1
- Date: Mon, 4 Mar 2024 12:27:32 GMT
- Title: Language and Speech Technology for Central Kurdish Varieties
- Authors: Sina Ahmadi, Daban Q. Jaff, Md Mahfuz Ibn Alam, Antonios
Anastasopoulos
- Abstract summary: Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum.
Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language.
In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish.
- Score: 27.751434601712
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Kurdish, an Indo-European language spoken by over 30 million speakers, is
considered a dialect continuum and known for its diversity in language
varieties. Previous studies addressing language and speech technology for
Kurdish handle it in a monolithic way as a macro-language, resulting in
disparities for dialects and varieties for which there are few resources and
tools available. In this paper, we take a step towards developing resources for
language and speech technology for varieties of Central Kurdish, creating a
corpus by transcribing movies and TV series as an alternative to fieldwork.
Additionally, we report the performance of machine translation, automatic
speech recognition, and language identification as downstream tasks evaluated
on Central Kurdish varieties. Data and models are publicly available under an
open license at https://github.com/sinaahmadi/CORDI.
Related papers
- Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach [0.9217021281095906]
We improve the Kurdish TTS system based on Tacotron by training the Kurdish WaveGlow vocoder on a 21-hour central Kurdish speech corpus.
Our adaptive WaveGlow model achieves an impressive MOS of 4.91, which sets a new benchmark for Kurdish speech synthesis.
arXiv Detail & Related papers (2024-09-10T06:23:52Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Multimodal Modeling For Spoken Language Identification [57.94119986116947]
Spoken language identification refers to the task of automatically predicting the spoken language in a given utterance.
We propose MuSeLI, a Multimodal Spoken Language Identification method, which delves into the use of various metadata sources to enhance language identification.
arXiv Detail & Related papers (2023-09-19T12:21:39Z) - Approaches to Corpus Creation for Low-Resource Language Technology: the
Case of Southern Kurdish and Laki [29.27024733066261]
We describe some of the challenges of such under-represented languages, particularly in writing and standardization.
We also study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.
arXiv Detail & Related papers (2023-04-03T19:36:32Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon [4.226093500082746]
We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira.
The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries.
Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language.
A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
arXiv Detail & Related papers (2021-02-15T09:27:54Z) - Leveraging Multilingual News Websites for Building a Kurdish Parallel
Corpus [0.6445605125467573]
We present a corpus containing 12,327 translation pairs in the two major dialects of Kurdish, Sorani and Kurmanji.
We also provide 1,797 and 650 translation pairs in English-Kurmanji and English-Sorani.
arXiv Detail & Related papers (2020-10-04T11:52:50Z) - Can Multilingual Language Models Transfer to an Unseen Dialect? A Case
Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect.
We take user generated North-African Arabic as our case study.
We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.