The Evolution of Darija Open Dataset: Introducing Version 2
- URL: http://arxiv.org/abs/2405.13016v1
- Date: Tue, 14 May 2024 15:08:32 GMT
- Title: The Evolution of Darija Open Dataset: Introducing Version 2
- Authors: Aissam Outchakoucht, Hamza Es-Samaali,
- Abstract summary: DODa stands as the largest collaborative project of its kind for Darija-English translation.
This paper explores the strategic importance of DODa, its current achievements, and the envisioned future enhancements.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Darija Open Dataset (DODa) represents an open-source project aimed at enhancing Natural Language Processing capabilities for the Moroccan dialect, Darija. With approximately 100,000 entries, DODa stands as the largest collaborative project of its kind for Darija-English translation. The dataset features semantic and syntactic categorizations, variations in spelling, verb conjugations across multiple tenses, as well as tens of thousands of translated sentences. The dataset includes entries written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications. The availability of such dataset is critical for developing applications that can accurately understand and generate Darija, thus supporting the linguistic needs of the Moroccan community and potentially extending to similar dialects in neighboring regions. This paper explores the strategic importance of DODa, its current achievements, and the envisioned future enhancements that will continue to promote its use and expansion in the global NLP landscape.
Related papers
- BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation [28.456351723077088]
This dataset is handcrafted in non-English languages first.
Each of these source languages is represented among the 23 languages commonly used by half of the world's population.
arXiv Detail & Related papers (2025-02-06T18:56:37Z) - WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages.
We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z) - Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages [27.273651323572786]
BhasaAnuvaad is the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments.<n>Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation.<n>We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.
arXiv Detail & Related papers (2024-11-07T13:33:34Z) - Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect [45.755756115243486]
We construct our instruction dataset by consolidating existing Darija language resources.
Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions.
arXiv Detail & Related papers (2024-09-26T14:56:38Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Moroccan Dialect -Darija- Open Dataset [0.0]
Darija Open dataset (DODa) is an open-source project for the Moroccan dialect.
DODa is arguably the largest open-source collaborative project for Darija-English translation built for Natural Language Processing purposes.
This data paper presents a description of DODa, its features, how it was collected, and a first application in Image Classification using ImageNet labels translated to Darija.
arXiv Detail & Related papers (2021-02-28T13:37:59Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.