Moroccan Dialect -Darija- Open Dataset
- URL: http://arxiv.org/abs/2103.09687v1
- Date: Sun, 28 Feb 2021 13:37:59 GMT
- Title: Moroccan Dialect -Darija- Open Dataset
- Authors: Aissam Outchakoucht, Hamza Es-Samaali
- Abstract summary: Darija Open dataset (DODa) is an open-source project for the Moroccan dialect.
DODa is arguably the largest open-source collaborative project for Darija-English translation built for Natural Language Processing purposes.
This data paper presents a description of DODa, its features, how it was collected, and a first application in Image Classification using ImageNet labels translated to Darija.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Darija Open Dataset (DODa) is an open-source project for the Moroccan
dialect. With more than 10,000 entries DODa is arguably the largest open-source
collaborative project for Darija-English translation built for Natural Language
Processing purposes. In fact, besides semantic categorization, DODa also adopts
a syntactic one, presents words under different spellings, offers verb-to-noun
and masculine-to-feminine correspondences, contains the conjugation of hundreds
of verbs in different tenses, and many other subsets to help researchers better
understand and study Moroccan dialect. This data paper presents a description
of DODa, its features, how it was collected, as well as a first application in
Image Classification using ImageNet labels translated to Darija. This
collaborative project is hosted on GitHub platform under MIT's Open-Source
license and aims to be a standard resource for researchers, students, and
anyone who is interested in Moroccan Dialect
Related papers
- Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect [45.755756115243486]
We introduce Atlas-Chat, the first-ever collection of large language models specifically developed for dialectal Arabic.
We construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically.
Atlas-Chat-9B and 2B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks.
arXiv Detail & Related papers (2024-09-26T14:56:38Z) - DarijaBanking: A New Resource for Overcoming Language Barriers in Banking Intent Detection for Moroccan Arabic Speakers [5.274804664403783]
Navigating the complexities of language diversity is a central challenge in developing robust natural language processing systems.
This paper introduces textbfDarijaBanking, a novel Darija dataset aimed at enhancing intent classification in the banking domain.
DarijaBanking comprises over 1,800 parallel high-quality queries in Darija, Modern Standard Arabic (MSA), English, and French, organized into 24 intent classes.
arXiv Detail & Related papers (2024-05-26T08:33:28Z) - The Evolution of Darija Open Dataset: Introducing Version 2 [0.0]
DODa stands as the largest collaborative project of its kind for Darija-English translation.
This paper explores the strategic importance of DODa, its current achievements, and the envisioned future enhancements.
arXiv Detail & Related papers (2024-05-14T15:08:32Z) - Language and Speech Technology for Central Kurdish Varieties [27.751434601712]
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum.
Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language.
In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish.
arXiv Detail & Related papers (2024-03-04T12:27:32Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Sentiment Analysis Dataset in Moroccan Dialect: Bridging the Gap Between Arabic and Latin Scripted dialect [0.0]
This study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity.
By assembling a diverse range of textual data, we were able to construct a dataset with a range of 20 000 manually labeled text in Moroccan dialect.
To dive into sentiment analysis, we conducted a comparative study on multiple Machine learning models to assess their compatibility with our dataset.
arXiv Detail & Related papers (2023-03-28T14:02:42Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.