SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System
- URL: http://arxiv.org/abs/2508.02268v1
- Date: Mon, 04 Aug 2025 10:21:11 GMT
- Title: SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System
- Authors: Serry Sibaee, Omer Nacar, Yasser Al-Habashi, Adel Ammar, Wadii Boulila,
- Abstract summary: This paper introduces textbfSHAMI-MT, a bidirectional machine translation system specifically engineered to bridge the communication gap between Modern Standard Arabic (MSA) and the Syrian dialect.<n>We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture.<n>Our MSA-to-Shami model achieved an outstanding average quality score of textbf4.01 out of 5.0 when judged by OPENAI model GPT-4.1.
- Score: 0.995313069446686
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.
Related papers
- Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic [15.807843278492847]
We introduce a universal methodology for Arabic speech and text processing designed to address unique challenges of the language.<n>We train two novel models based on the FastConformer architecture: one designed specifically for Modern Standard Arabic (MSA) and the other, the first unified public model for both MSA and Classical Arabic (CA)<n>The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA.
arXiv Detail & Related papers (2025-07-18T14:42:18Z) - Towards Explainable Bilingual Multimodal Misinformation Detection and Localization [64.37162720126194]
BiMi is a framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis.<n>BiMiBench is a benchmark constructed by systematically editing real news images and subtitles.<n>BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore.
arXiv Detail & Related papers (2025-06-28T15:43:06Z) - AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.<n>First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
arXiv Detail & Related papers (2024-09-17T17:59:25Z) - ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation [1.8109081066789847]
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature.
We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics.
We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
arXiv Detail & Related papers (2024-07-29T09:45:34Z) - ALLaM: Large Language Models for Arabic and English [9.881560166505452]
We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT)
Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English)
We show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment.
arXiv Detail & Related papers (2024-07-22T05:35:17Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal
Conversations on Online Social Media [5.2957928879391]
We propose an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects.
Our results have shown a superior performance of our neural MT models trained using our dataset.
arXiv Detail & Related papers (2023-09-21T14:58:50Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten
Arabic Varieties [18.73290429469502]
We assess Bard and ChatGPT regarding their machine translation proficiencies across ten varieties of Arabic.
Our evaluation covers diverse Arabic varieties such as Classical Arabic (CA), Modern Standard Arabic (MSA), and several country-level dialectal variants.
On CA and MSA, instruction-tuned LLMs, however, trail behind commercial systems such as Google Translate.
arXiv Detail & Related papers (2023-08-06T08:29:16Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.