Related papers: PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian

Related papers

MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints [7.822567458977689]
MyCulture is a benchmark designed to comprehensively evaluate Large Language Models (LLMs) on Malaysian culture.<n>Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options.<n>We analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations.
arXiv Detail & Related papers (2025-08-07T14:17:43Z)
MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language [0.8182812460605992]
This study focuses on the Persian language and Iranian culture.<n>We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams.<n>Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.
arXiv Detail & Related papers (2025-08-01T14:46:57Z)
Disentangling Language and Culture for Evaluating Multilingual Large Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv Detail & Related papers (2025-05-30T14:25:45Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
CARE: Aligning Language Models for Regional Cultural Awareness [28.676469530858924]
Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge. Previous attempts to address this rely on synthetic data and express cultural knowledge only in English. We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures.
arXiv Detail & Related papers (2025-04-07T14:57:06Z)
Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench [37.63947763066401]
We introduce CQ-Bench, a benchmark designed to assess large language models' capability to infer implicit cultural values. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection, they still fall short in nuanced attitude detection. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning.
arXiv Detail & Related papers (2025-04-01T18:54:47Z)
Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking [12.078532717928185]
Large language models (LLMs) continue to exhibit biases toward Western, Anglo-centric, or American cultures. We introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs. We find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations.
arXiv Detail & Related papers (2025-02-28T22:28:00Z)
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs [2.5212698425008377]
Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. We compare two families of models: Google's Gemma models and OpenAI's turbo-series. We find no consistent relationships between language capabilities and cultural alignment.
arXiv Detail & Related papers (2025-02-23T11:02:41Z)
Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation [36.92567530333872]
We study adding a new language, i.e. Persian, to a large language model (LLMs)<n>We employ a multi-stage approach involving pretraining on monolingual Persian data.<n>We evaluate the model's performance at each stage on generation and classification tasks.
arXiv Detail & Related papers (2024-12-17T23:18:06Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs) We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z)
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? [3.4812080203308984]
Khayyam Challenge (also known as PersianMMLU) is a collection of 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language.
arXiv Detail & Related papers (2024-04-09T22:38:13Z)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
CultureLLM: Incorporating Cultural Differences into Large Language Models [36.66184989869121]
CultureLLM is a cost-effective solution to incorporate cultural differences into large language models.<n>We fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9 cultures covering rich and low-resource languages.<n>Our human study shows that the generated samples are semantically equivalent to the original samples.
arXiv Detail & Related papers (2024-02-09T04:02:43Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.