SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia
- URL: http://arxiv.org/abs/2503.17485v1
- Date: Fri, 21 Mar 2025 18:55:10 GMT
- Title: SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia
- Authors: Lama Ayash, Hassan Alhuzali, Ashwag Alasmari, Sultan Aloufi,
- Abstract summary: This research focuses on Saudi Arabia, a country characterized by diverse dialects and rich cultural traditions.<n>We introduce SaudiCulture, a novel benchmark designed to evaluate the cultural competence of Large Language Models (LLMs)<n>The dataset encompasses a broad spectrum of cultural domains, including food, clothing, entertainment, celebrations, and crafts.
- Score: 0.1499944454332829
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing; however, they often struggle to accurately capture and reflect cultural nuances. This research addresses this challenge by focusing on Saudi Arabia, a country characterized by diverse dialects and rich cultural traditions. We introduce SaudiCulture, a novel benchmark designed to evaluate the cultural competence of LLMs within the distinct geographical and cultural contexts of Saudi Arabia. SaudiCulture is a comprehensive dataset of questions covering five major geographical regions, such as West, East, South, North, and Center, along with general questions applicable across all regions. The dataset encompasses a broad spectrum of cultural domains, including food, clothing, entertainment, celebrations, and crafts. To ensure a rigorous evaluation, SaudiCulture includes questions of varying complexity, such as open-ended, single-choice, and multiple-choice formats, with some requiring multiple correct answers. Additionally, the dataset distinguishes between common cultural knowledge and specialized regional aspects. We conduct extensive evaluations on five LLMs, such as GPT-4, Llama 3.3, FANAR, Jais, and AceGPT, analyzing their performance across different question types and cultural contexts. Our findings reveal that all models experience significant performance declines when faced with highly specialized or region-specific questions, particularly those requiring multiple correct responses. Additionally, certain cultural categories are more easily identifiable than others, further highlighting inconsistencies in LLMs cultural understanding. These results emphasize the importance of incorporating region-specific knowledge into LLMs training to enhance their cultural competence.
Related papers
- CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions.
Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora.
We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z) - GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking [29.664707739055068]
We introduce GIMMICK, an extensive benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries.
GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets.
We examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues.
arXiv Detail & Related papers (2025-02-19T14:27:40Z) - CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs [75.82306181299153]
We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for assessing cultural knowledge.
We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently.
Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%.
arXiv Detail & Related papers (2024-10-03T17:04:31Z) - Benchmarking Vision Language Models for Cultural Understanding [31.898921287065242]
This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing Vision Language Models (VLMs)
We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents.
The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions.
arXiv Detail & Related papers (2024-07-15T17:21:41Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding.
This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z) - CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations.
We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models [41.885600036131045]
CDEval is a benchmark aimed at evaluating the cultural dimensions of Large Language Models.
It is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains.
arXiv Detail & Related papers (2023-11-28T02:01:25Z) - Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions [10.415002561977655]
This research proposes a Cultural Alignment Test (Hoftede's CAT) to quantify cultural alignment using Hofstede's cultural dimension framework.
We quantitatively evaluate large language models (LLMs) against the cultural dimensions of regions like the United States, China, and Arab countries.
Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions.
arXiv Detail & Related papers (2023-08-25T14:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.