Gazelle: An Instruction Dataset for Arabic Writing Assistance
- URL: http://arxiv.org/abs/2410.18163v2
- Date: Mon, 04 Nov 2024 19:29:40 GMT
- Title: Gazelle: An Instruction Dataset for Arabic Writing Assistance
- Authors: Samar M. Magdy, Fakhraddin Alwajih, Sang Yun Kwon, Reem Abdel-Salam, Muhammad Abdul-Mageed,
- Abstract summary: We present Gazelle, a comprehensive dataset for Arabic writing assistance.
We also offer an evaluation framework designed to enhance Arabic writing assistance tools.
Our findings underscore the need for continuous model training and dataset enrichment.
- Score: 12.798604366250261
- License:
- Abstract: Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Language Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues, we present Gazelle, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-4, GPT-4o, Cohere Command R+, and Gemini 1.5 Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving the way for more effective AI-powered Arabic writing tools.
Related papers
- From Arabic Text to Puzzles: LLM-Driven Development of Arabic Educational Crosswords [10.876144855651608]
This project addresses the scarcity of advanced educational tools tailored for the Arabic language.
By providing a culturally and linguistically relevant tool, our objective is to make learning more engaging and effective.
This tool not only advances educational paradigms but also sets a new standard in interactive and cognitive learning technologies.
arXiv Detail & Related papers (2025-01-19T12:57:34Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.
One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.
Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - HATFormer: Historic Handwritten Arabic Text Recognition with Transformers [6.3660090769559945]
Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models.
We propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model.
Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges.
arXiv Detail & Related papers (2024-10-03T03:43:29Z) - Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - Transformer Models in Education: Summarizing Science Textbooks with AraBART, MT5, AraT5, and mBART [4.214194481944042]
We have developed an advanced text summarization system targeting Arabic textbooks.
This system evaluates and extracts the most important sentences found in biology textbooks for the 11th and 12th grades in the Palestinian curriculum.
arXiv Detail & Related papers (2024-06-11T20:14:09Z) - 101 Billion Arabic Words Dataset [0.0]
This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models.
We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files.
The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset.
arXiv Detail & Related papers (2024-04-29T13:15:03Z) - From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.
We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Arabic Text Sentiment Analysis: Reinforcing Human-Performed Surveys with
Wider Topic Analysis [49.1574468325115]
The in-depth study manually analyses 133 ASA papers published in the English language between 2002 and 2020.
The main findings show the different approaches used for ASA: machine learning, lexicon-based and hybrid approaches.
There is a need to develop ASA tools that can be used in industry, as well as in academia, for Arabic text SA.
arXiv Detail & Related papers (2024-03-04T10:37:48Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and
Future Trends [15.302538985992518]
We provide a comprehensive review of the development of Arabic NER.
Traditional Arabic NER systems focus on feature engineering and designing domain-specific rules.
With the growth of pre-trained language model, Arabic NER yields better performance.
arXiv Detail & Related papers (2023-02-07T14:56:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.