OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal
Conversations on Online Social Media
- URL: http://arxiv.org/abs/2309.12137v1
- Date: Thu, 21 Sep 2023 14:58:50 GMT
- Title: OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal
Conversations on Online Social Media
- Authors: Fatimah Alzamzami, Abdulmotaleb El Saddik
- Abstract summary: We propose an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects.
Our results have shown a superior performance of our neural MT models trained using our dataset.
- Score: 5.2957928879391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While resources for English language are fairly sufficient to understand
content on social media, similar resources in Arabic are still immature. The
main reason that the resources in Arabic are insufficient is that Arabic has
many dialects in addition to the standard version (MSA). Arabs do not use MSA
in their daily communications; rather, they use dialectal versions.
Unfortunately, social users transfer this phenomenon into their use of social
media platforms, which in turn has raised an urgent need for building suitable
AI models for language-dependent applications. Existing machine translation
(MT) systems designed for MSA fail to work well with Arabic dialects. In light
of this, it is necessary to adapt to the informal nature of communication on
social networks by developing MT systems that can effectively handle the
various dialects of Arabic. Unlike for MSA that shows advanced progress in MT
systems, little effort has been exerted to utilize Arabic dialects for MT
systems. While few attempts have been made to build translation datasets for
dialectal Arabic, they are domain dependent and are not OSN cultural-language
friendly. In this work, we attempt to alleviate these limitations by proposing
an online social network-based multidialect Arabic dataset that is crafted by
contextually translating English tweets into four Arabic dialects: Gulf,
Yemeni, Iraqi, and Levantine. To perform the translation, we followed our
proposed guideline framework for content translation, which could be
universally applicable for translation between foreign languages and local
dialects. We validated the authenticity of our proposed dataset by developing
neural MT models for four Arabic dialects. Our results have shown a superior
performance of our NMT models trained using our dataset. We believe that our
dataset can reliably serve as an Arabic multidialectal translation dataset for
informal MT tasks.
Related papers
- ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation [1.8109081066789847]
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature.
We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics.
We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
arXiv Detail & Related papers (2024-07-29T09:45:34Z) - ALLaM: Large Language Models for Arabic and English [9.881560166505452]
We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT)
Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English)
We show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment.
arXiv Detail & Related papers (2024-07-22T05:35:17Z) - AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs [1.6381055567716192]
This paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems.
We focus on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic.
We present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma.
arXiv Detail & Related papers (2024-06-26T07:19:51Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Content-Localization based Neural Machine Translation for Informal
Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic [5.2957928879391]
We propose a framework that localizes contents of high-resource languages to a low-resource language/dialects by utilizing AI power.
We are the first work to provide a parallel translation dataset from/to informal Spanish and French to/from informal Arabic dialects.
arXiv Detail & Related papers (2023-12-12T01:42:41Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.