Related papers: Gazelle: An Instruction Dataset for Arabic Writing Assistance

Gazelle: An Instruction Dataset for Arabic Writing Assistance

URL: http://arxiv.org/abs/2410.18163v2
Date: Mon, 04 Nov 2024 19:29:40 GMT
Title: Gazelle: An Instruction Dataset for Arabic Writing Assistance
Authors: Samar M. Magdy, Fakhraddin Alwajih, Sang Yun Kwon, Reem Abdel-Salam, Muhammad Abdul-Mageed,
Abstract summary: We present Gazelle, a comprehensive dataset for Arabic writing assistance. We also offer an evaluation framework designed to enhance Arabic writing assistance tools. Our findings underscore the need for continuous model training and dataset enrichment.
Score: 12.798604366250261
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Language Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues, we present Gazelle, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-4, GPT-4o, Cohere Command R+, and Gemini 1.5 Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving the way for more effective AI-powered Arabic writing tools.

Related papers

ARWI: Arabic Write and Improve [10.198081881605226]
ARWI is a writing assistant that helps learners improve essay writing in Modern Standard Arabic. It includes a prompt database for different proficiency levels, an Arabic text editor, state-of-the-art grammatical error detection and correction, and automated essay scoring. A preliminary user study shows that ARWI provides actionable feedback, helping learners identify grammatical gaps, assess language proficiency, and guide improvement.
arXiv Detail & Related papers (2025-04-16T07:00:47Z)
Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM [32.99591671206201]
Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. We present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks.
arXiv Detail & Related papers (2025-03-18T18:03:49Z)
From Arabic Text to Puzzles: LLM-Driven Development of Arabic Educational Crosswords [10.876144855651608]
This project addresses the scarcity of advanced educational tools tailored for the Arabic language. By providing a culturally and linguistically relevant tool, our objective is to make learning more engaging and effective. This tool not only advances educational paradigms but also sets a new standard in interactive and cognitive learning technologies.
arXiv Detail & Related papers (2025-01-19T12:57:34Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world. One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
HATFormer: Historic Handwritten Arabic Text Recognition with Transformers [6.3660090769559945]
Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. We propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges.
arXiv Detail & Related papers (2024-10-03T03:43:29Z)
Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z)
GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content. We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality. Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z)
Transformer Models in Education: Summarizing Science Textbooks with AraBART, MT5, AraT5, and mBART [4.214194481944042]
We have developed an advanced text summarization system targeting Arabic textbooks. This system evaluates and extracts the most important sentences found in biology textbooks for the 11th and 12th grades in the Palestinian curriculum.
arXiv Detail & Related papers (2024-06-11T20:14:09Z)
101 Billion Arabic Words Dataset [0.0]
This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset.
arXiv Detail & Related papers (2024-04-29T13:15:03Z)
From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic. We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
Arabic Text Sentiment Analysis: Reinforcing Human-Performed Surveys with Wider Topic Analysis [49.1574468325115]
The in-depth study manually analyses 133 ASA papers published in the English language between 2002 and 2020. The main findings show the different approaches used for ASA: machine learning, lexicon-based and hybrid approaches. There is a need to develop ASA tools that can be used in industry, as well as in academia, for Arabic text SA.
arXiv Detail & Related papers (2024-03-04T10:37:48Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs) We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora. Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z)
Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language. The proposed framework can explain specific predictions by training a local surrogate explainable model. We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and Future Trends [15.302538985992518]
We provide a comprehensive review of the development of Arabic NER. Traditional Arabic NER systems focus on feature engineering and designing domain-specific rules. With the growth of pre-trained language model, Arabic NER yields better performance.
arXiv Detail & Related papers (2023-02-07T14:56:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.