The first large scale collection of diverse Hausa language datasets
- URL: http://arxiv.org/abs/2102.06991v2
- Date: Tue, 16 Feb 2021 20:13:34 GMT
- Title: The first large scale collection of diverse Hausa language datasets
- Authors: Isa Inuwa-Dutse
- Abstract summary: Hausa is considered well-studied and documented language among the sub-Saharan African languages.
It is estimated that over 100 million people speak the language.
We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hausa language belongs to the Afroasiatic phylum, and with more
first-language speakers than any other sub-Saharan African language. With a
majority of its speakers residing in the Northern and Southern areas of Nigeria
and the Republic of Niger, respectively, it is estimated that over 100 million
people speak the language. Hence, making it one of the most spoken Chadic
language. While Hausa is considered well-studied and documented language among
the sub-Saharan African languages, it is viewed as a low resource language from
the perspective of natural language processing (NLP) due to limited resources
to utilise in NLP-related tasks. This is common to most languages in Africa;
thus, it is crucial to enrich such languages with resources that will support
and speed the pace of conducting various downstream tasks to meet the demand of
the modern society. While there exist useful datasets, notably from news sites
and religious texts, more diversity is needed in the corpus.
We provide an expansive collection of curated datasets consisting of both
formal and informal forms of the language from refutable websites and online
social media networks, respectively. The collection is large and more diverse
than the existing corpora by providing the first and largest set of Hausa
social media data posts to capture the peculiarities in the language. The
collection also consists of a parallel dataset, which can be used for tasks
such as machine translation with applications in areas such as the detection of
spurious or inciteful online content. We describe the curation process -- from
the collection, preprocessing and how to obtain the data -- and proffer some
research problems that could be addressed using the data.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - \`It\`ak\'ur\`oso: Exploiting Cross-Lingual Transferability for Natural
Language Generation of Dialogues in Low-Resource, African Languages [0.9511471519043974]
We investigate the possibility of cross-lingual transfer from a state-of-the-art (SoTA) deep monolingual model to 6 African languages.
The languages are Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorub'a.
The results show that the hypothesis that deep monolingual models learn some abstractions that generalise across languages holds.
arXiv Detail & Related papers (2022-04-17T20:23:04Z) - NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual
Sentiment Analysis [5.048355865260207]
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria.
The dataset consists of around 30,000 annotated tweets per language.
We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.
arXiv Detail & Related papers (2022-01-20T16:28:06Z) - Learnings from Technological Interventions in a Low Resource Language: A
Case-Study on Gondi [13.9876704685177]
Gondi is a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India.
At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences.
The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies.
arXiv Detail & Related papers (2020-04-21T20:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.