CIDAR: Culturally Relevant Instruction Dataset For Arabic
- URL: http://arxiv.org/abs/2402.03177v1
- Date: Mon, 5 Feb 2024 16:44:17 GMT
- Title: CIDAR: Culturally Relevant Instruction Dataset For Arabic
- Authors: Zaid Alyafeai, Khalid Almubarak, Ahmed Ashraf, Deema Alnuhait, Saied
Alshahrani, Gubran A. Q. Abdulrahman, Gamil Ahmed, Qais Gawah, Zead Saleh,
Mustafa Ghaleb, Yousef Ali, Maged S. Al-Shaibani
- Abstract summary: This paper introduces CIDAR, the first open Arabic instruction-tuning dataset culturally-aligned by human reviewers.
CIDAR contains 10,000 instruction and output pairs that represent the Arab region.
- Score: 5.179940415076765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction tuning has emerged as a prominent methodology for teaching Large
Language Models (LLMs) to follow instructions. However, current instruction
datasets predominantly cater to English or are derived from English-dominated
LLMs, resulting in inherent biases toward Western culture. This bias
significantly impacts the linguistic structures of non-English languages such
as Arabic, which has a distinct grammar reflective of the diverse cultures
across the Arab region. This paper addresses this limitation by introducing
CIDAR: https://hf.co/datasets/arbml/CIDAR, the first open Arabic
instruction-tuning dataset culturally-aligned by human reviewers. CIDAR
contains 10,000 instruction and output pairs that represent the Arab region. We
discuss the cultural relevance of CIDAR via the analysis and comparison to
other models fine-tuned on other datasets. Our experiments show that CIDAR can
help enrich research efforts in aligning LLMs with the Arabic culture. All the
code is available at https://github.com/ARBML/CIDAR.
Related papers
- Commonsense Reasoning in Arab Culture [6.116784716369165]
We introduce datasetname, a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley.
The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries.
datasetname spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences.
arXiv Detail & Related papers (2025-02-18T11:49:54Z) - On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena [10.263201685476492]
This paper aims to uncover the origins of entity-related cultural biases in Language Models (LMs)
We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities.
Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic.
arXiv Detail & Related papers (2025-01-08T18:15:47Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.
One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.
Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - Arabic Dataset for LLM Safeguard Evaluation [62.96160492994489]
This study explores the safety of large language models (LLMs) in Arabic with its linguistic and cultural complexities.
We present an Arab-region-specific safety evaluation dataset consisting of 5,799 questions, including direct attacks, indirect attacks, and harmless requests with sensitive words.
arXiv Detail & Related papers (2024-10-22T14:12:43Z) - WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.
We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.
This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z) - AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.
First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
arXiv Detail & Related papers (2024-09-17T17:59:25Z) - ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation [1.8109081066789847]
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature.
We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics.
We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
arXiv Detail & Related papers (2024-07-29T09:45:34Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.