Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
- URL: http://arxiv.org/abs/2110.06744v1
- Date: Wed, 13 Oct 2021 14:25:21 GMT
- Title: Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
- Authors: Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb and Maged S. Al-shaibani
- Abstract summary: textitMasader is the largest public catalogue for Arabic NLP datasets.
We develop a metadata annotation strategy that could be extended to other languages.
- Score: 3.345437353879255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The NLP pipeline has evolved dramatically in the last few years. The first
step in the pipeline is to find suitable annotated datasets to evaluate the
tasks we are trying to solve. Unfortunately, most of the published datasets
lack metadata annotations that describe their attributes. Not to mention, the
absence of a public catalogue that indexes all the publicly available datasets
related to specific regions or languages. When we consider low-resource
dialectical languages, for example, this issue becomes more prominent. In this
paper we create \textit{Masader}, the largest public catalogue for Arabic NLP
datasets, which consists of 200 datasets annotated with 25 attributes.
Furthermore, We develop a metadata annotation strategy that could be extended
to other languages. We also make remarks and highlight some issues about the
current status of Arabic NLP datasets and suggest recommendations to address
them.
Related papers
- Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - A diverse Multilingual News Headlines Dataset from around the World [57.37355895609648]
Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide.
It serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles.
arXiv Detail & Related papers (2024-03-28T12:08:39Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z) - Low resource language dataset creation, curation and classification:
Setswana and Sepedi -- Extended Abstract [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages.
arXiv Detail & Related papers (2020-03-30T18:03:15Z) - Investigating an approach for low resource language dataset creation,
curation and classification: Setswana and Sepedi [2.3801001093799115]
We create datasets that are focused on news headlines for Setswana and Sepedi.
We also create a news topic classification task.
We investigate an approach on data augmentation, better suited to low resource languages.
arXiv Detail & Related papers (2020-02-18T13:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.