Related papers: ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

URL: http://arxiv.org/abs/2407.19835v1
Date: Mon, 29 Jul 2024 09:45:34 GMT
Title: ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
Authors: Mohammed Khalil, Mohammed Sabry,
Abstract summary: Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature. We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics. We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
Score: 1.8109081066789847
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples that cover a wide array of subjects including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a need for such datasets in current systems. Our findings highlight how models can benefit from fine-tuning or incorporating this dataset into their pretraining pipelines. The dataset is publicly available on the HuggingFace Data Hub at \url{https://huggingface.co/datasets/mohamed-khalil/ATHAR}.

Related papers

Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines [0.8944616102795021]
This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system. We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers. Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions.
arXiv Detail & Related papers (2025-04-30T09:56:36Z)
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs [14.874327728051288]
We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations.
arXiv Detail & Related papers (2025-02-28T19:59:13Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world. One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
A Survey of Large Language Models for Arabic Language and its Dialects [0.0]
This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks.
arXiv Detail & Related papers (2024-10-26T17:48:20Z)
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions. We will release the dialectal translation models and benchmarks curated in this study.
arXiv Detail & Related papers (2024-09-17T17:59:25Z)
A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z)
GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content. We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality. Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z)
101 Billion Arabic Words Dataset [0.0]
This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset.
arXiv Detail & Related papers (2024-04-29T13:15:03Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
New Arabic Medical Dataset for Diseases Classification [55.41644538483948]
We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites. The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological) Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
arXiv Detail & Related papers (2021-06-29T10:42:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.