Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
- URL: http://arxiv.org/abs/2404.00565v1
- Date: Sun, 31 Mar 2024 05:14:38 GMT
- Title: Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
- Authors: Saied Alshahrani, Hesham Haroon, Ali Elfilali, Mariama Njie, Jeanna Matthews,
- Abstract summary: A few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY)
We aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier, XGBoost, as an online application called EGYPTIAN WIKIPEDIA SCANNER and release the extracted, filtered, and labeled datasets to the research community to benefit from our datasets and the online, web-based detection system.
Related papers
- ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs [1.6381055567716192]
This paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems.
We focus on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic.
We present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma.
arXiv Detail & Related papers (2024-06-26T07:19:51Z) - Arabic Text Sentiment Analysis: Reinforcing Human-Performed Surveys with
Wider Topic Analysis [49.1574468325115]
The in-depth study manually analyses 133 ASA papers published in the English language between 2002 and 2020.
The main findings show the different approaches used for ASA: machine learning, lexicon-based and hybrid approaches.
There is a need to develop ASA tools that can be used in industry, as well as in academia, for Arabic text SA.
arXiv Detail & Related papers (2024-03-04T10:37:48Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - Sentiment Analysis in Poems in Misurata Sub-dialect -- A Sentiment
Detection in an Arabic Sub-dialect [0.0]
This study focuses on detecting sentiment in poems written in Misurata Arabic sub-dialect spoken in Libya.
The tools used to detect sentiment from the dataset are Sklearn as well as Mazajak sentiment tool 1.
arXiv Detail & Related papers (2021-09-15T10:42:39Z) - New Arabic Medical Dataset for Diseases Classification [55.41644538483948]
We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites.
The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological)
Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
arXiv Detail & Related papers (2021-06-29T10:42:53Z) - Machine Generation and Detection of Arabic Manipulated and Fake News [8.014703200985084]
We present a novel method for automatically generating Arabic manipulated (and potentially fake) news stories.
Our method is simple and only depends on availability of true stories, which are abundant online, and a part of speech tagger (POS)
We carry out a human annotation study that casts light on the effects of machine manipulation on text veracity.
We develop the first models for detecting manipulated Arabic news and achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-05T20:50:22Z) - AraDIC: Arabic Document Classification using Image-Based Character
Embeddings and Class-Balanced Loss [7.734726150561088]
We propose a novel end-to-end Arabic document classification framework, Arabic document image-based classifier (AraDIC)
AraDIC consists of an image-based character encoder and a classifier. They are trained in an end-to-end fashion using the class balanced loss to deal with the long-tailed data distribution problem.
To the best of our knowledge, this is the first image-based character embedding framework addressing the problem of Arabic text classification.
arXiv Detail & Related papers (2020-06-20T14:25:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.