Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for
Natural Language Processing Tasks
- URL: http://arxiv.org/abs/2208.12081v2
- Date: Sat, 8 Jul 2023 20:37:28 GMT
- Title: Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for
Natural Language Processing Tasks
- Authors: Barack Wanjawa, Lilian Wanzare, Florence Indede, Owen McOnyango,
Edward Ombui, Lawrence Muchemi
- Abstract summary: The Kencorpus project intends to bridge the gap by collecting and storing text and speech data.
The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya.
The datasets are useful for downstream machine learning tasks such as model training and translation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Indigenous African languages are categorized as under-served in Natural
Language Processing. They therefore experience poor digital inclusivity and
information access. The processing challenge with such languages has been how
to use machine learning and deep learning models without the requisite data.
The Kencorpus project intends to bridge this gap by collecting and storing text
and speech data that is good enough for data-driven solutions in applications
such as machine translation, question answering and transcription in
multilingual communities. The Kencorpus dataset is a text and speech corpus for
three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. Data
collection was done by researchers from communities, schools, media, and
publishers. The Kencorpus' dataset has a collection of 5,594 items - 4,442
texts (5.6M words) and 1,152 speech files (177hrs). Based on this data, Part of
Speech tagging sets for Dholuo and Luhya (50,000 and 93,000 words respectively)
were developed. We developed 7,537 Question-Answer pairs for Swahili and
created a text translation set of 13,400 sentences from Dholuo and Luhya into
Swahili. The datasets are useful for downstream machine learning tasks such as
model training and translation. We also developed two proof of concept systems:
for Kiswahili speech-to-text and machine learning system for Question Answering
task, with results of 18.87% word error rate and 80% Exact Match (EM)
respectively. These initial results give great promise to the usability of
Kencorpus to the machine learning community. Kencorpus is one of few public
domain corpora for these three low resource languages and forms a basis of
learning and sharing experiences for similar works especially for low resource
languages.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers.
In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations.
Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z) - Breaking Language Barriers: A Question Answering Dataset for Hindi and
Marathi [1.03590082373586]
This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi.
Despite Hindi being the 3rd most spoken language worldwide, and Marathi being the 11th most spoken language globally, both languages face limited resources for building efficient Question Answering systems.
We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples.
arXiv Detail & Related papers (2023-08-19T00:39:21Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource
Language [0.0]
This dataset is annotated from raw story texts of Swahili low resource language.
QA datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems.
The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project.
arXiv Detail & Related papers (2022-05-04T23:53:23Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - The first large scale collection of diverse Hausa language datasets [0.0]
Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
arXiv Detail & Related papers (2021-02-13T19:34:20Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.