PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian
        - URL: http://arxiv.org/abs/2502.07459v1
 - Date: Tue, 11 Feb 2025 11:07:44 GMT
 - Title: PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian
 - Authors: Erfan Moosavi Monazzah, Vahid Rahimzadeh, Yadollah Yaghoobzadeh, Azadeh Shakery, Mohammad Taher Pilehvar, 
 - Abstract summary: PerCul is a dataset designed to assess the sensitivity of LLMs toward Persian culture.<n>PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios.<n>We evaluate several state-of-the-art multilingual and Persian-specific LLMs.
 - Score: 19.816050739495573
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Large language models predominantly reflect Western cultures, largely due to the dominance of English-centric training data. This imbalance presents a significant challenge, as LLMs are increasingly used across diverse contexts without adequate evaluation of their cultural competence in non-English languages, including Persian. To address this gap, we introduce PerCul, a carefully constructed dataset designed to assess the sensitivity of LLMs toward Persian culture. PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios. Unlike existing benchmarks, PerCul is curated with input from native Persian annotators to ensure authenticity and to prevent the use of translation as a shortcut. We evaluate several state-of-the-art multilingual and Persian-specific LLMs, establishing a foundation for future research in cross-cultural NLP evaluation. Our experiments demonstrate a 11.3% gap between best closed source model and layperson baseline while the gap increases to 21.3% by using the best open-weight model. You can access the dataset from here: https://huggingface.co/datasets/teias-ai/percul 
 
       
      
        Related papers
        - MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource   Language Constraints [7.822567458977689]
MyCulture is a benchmark designed to comprehensively evaluate Large Language Models (LLMs) on Malaysian culture.<n>Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options.<n>We analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations.
arXiv  Detail & Related papers  (2025-08-07T14:17:43Z) - MELAC: Massive Evaluation of Large Language Models with Alignment of   Culture in Persian Language [0.8182812460605992]
This study focuses on the Persian language and Iranian culture.<n>We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams.<n>Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.
arXiv  Detail & Related papers  (2025-08-01T14:46:57Z) - Disentangling Language and Culture for Evaluating Multilingual Large   Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv  Detail & Related papers  (2025-05-30T14:25:45Z) - Evaluating Large Language Model with Knowledge Oriented Language   Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv  Detail & Related papers  (2025-05-22T12:27:02Z) - From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv  Detail & Related papers  (2025-05-22T09:00:01Z) - CARE: Aligning Language Models for Regional Cultural Awareness [28.676469530858924]
Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge.
Previous attempts to address this rely on synthetic data and express cultural knowledge only in English.
We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures.
arXiv  Detail & Related papers  (2025-04-07T14:57:06Z) - Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs'   Metacognitive Cultural Intelligence with CQ-Bench [37.63947763066401]
We introduce CQ-Bench, a benchmark designed to assess large language models' capability to infer implicit cultural values.
We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets.
We find that while o1 and Deepseek-R1 models reach human-level performance in value selection, they still fall short in nuanced attitude detection.
In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning.
arXiv  Detail & Related papers  (2025-04-01T18:54:47Z) - Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM   Benchmarking [12.078532717928185]
Large language models (LLMs) continue to exhibit biases toward Western, Anglo-centric, or American cultures.
We introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs.
We find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations.
arXiv  Detail & Related papers  (2025-02-28T22:28:00Z) - Multilingual != Multicultural: Evaluating Gaps Between Multilingual   Capabilities and Cultural Alignment in LLMs [2.5212698425008377]
Large Language Models (LLMs) are becoming increasingly capable across global languages.
However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations.
We compare two families of models: Google's Gemma models and OpenAI's turbo-series.
We find no consistent relationships between language capabilities and cultural alignment.
arXiv  Detail & Related papers  (2025-02-23T11:02:41Z) - Extending LLMs to New Languages: A Case Study of Llama and Persian   Adaptation [36.92567530333872]
We study adding a new language, i.e. Persian, to a large language model (LLMs)<n>We employ a multi-stage approach involving pretraining on monolingual Persian data.<n>We evaluate the model's performance at each stage on generation and classification tasks.
arXiv  Detail & Related papers  (2024-12-17T23:18:06Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv  Detail & Related papers  (2024-09-19T17:59:52Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering   Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv  Detail & Related papers  (2024-06-10T01:59:00Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language   Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv  Detail & Related papers  (2024-05-24T01:49:02Z) - CIVICS: Building a Dataset for Examining Culturally-Informed Values in   Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs)
We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv  Detail & Related papers  (2024-05-22T20:19:10Z) - Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian   Language? [3.4812080203308984]
Khayyam Challenge (also known as PersianMMLU) is a collection of 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations.
The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language.
arXiv  Detail & Related papers  (2024-04-09T22:38:13Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the   Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv  Detail & Related papers  (2024-02-20T16:02:12Z) - CultureLLM: Incorporating Cultural Differences into Large Language   Models [36.66184989869121]
CultureLLM is a cost-effective solution to incorporate cultural differences into large language models.<n>We fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9 cultures covering rich and low-resource languages.<n>Our human study shows that the generated samples are semantically equivalent to the original samples.
arXiv  Detail & Related papers  (2024-02-09T04:02:43Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv  Detail & Related papers  (2023-09-21T13:20:13Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.