Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench
- URL: http://arxiv.org/abs/2504.01127v1
- Date: Tue, 01 Apr 2025 18:54:47 GMT
- Title: Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench
- Authors: Ziyi Liu, Priyanka Dey, Zhenyu Zhao, Jen-tse Huang, Rahul Gupta, Yang Liu, Jieyu Zhao,
- Abstract summary: We introduce CQ-Bench, a benchmark designed to assess large language models' capability to infer implicit cultural values.<n>We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets.<n>We find that while o1 and Deepseek-R1 models reach human-level performance in value selection, they still fall short in nuanced attitude detection.<n>In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning.
- Score: 37.63947763066401
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cultural Intelligence (CQ) refers to the ability to understand unfamiliar cultural contexts-a crucial skill for large language models (LLMs) to effectively engage with globally diverse users. While existing research often focuses on explicitly stated cultural norms, such approaches fail to capture the subtle, implicit values that underlie real-world conversations. To address this gap, we introduce CQ-Bench, a benchmark specifically designed to assess LLMs' capability to infer implicit cultural values from natural conversational contexts. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets, with topics including ethical, religious, social, and political. Our dataset construction pipeline includes rigorous validation procedures-incorporation, consistency, and implicitness checks-using GPT-4o, with 98.2% human-model agreement in the final validation. Our benchmark consists of three tasks of increasing complexity: attitude detection, value selection, and value extraction. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection (0.809 and 0.814), they still fall short in nuanced attitude detection, with F1 scores of 0.622 and 0.635, respectively. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning. Notably, fine-tuning smaller models (e.g., LLaMA-3.2-3B) on only 500 culturally rich examples improves performance by over 10%, even outperforming stronger baselines (o3-mini) in some cases. Using CQ-Bench, we provide insights into the current challenges in LLMs' CQ research and suggest practical pathways for enhancing LLMs' cross-cultural reasoning abilities.
Related papers
- CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives [3.7731230532888036]
CLASH (Character perspective-based LLM Assessments in Situations with High-stakes) is a dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values.
Even the strongest models, such as GPT-4o and Claude-Sonnet, achieve less than 50% accuracy in identifying situations where the decision should be ambivalent.
arXiv Detail & Related papers (2025-04-15T02:54:16Z) - CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions.
Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora.
We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z) - Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task.
We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z) - Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs [7.802103248428407]
We identify and test three assumptions behind current survey-based evaluation methods.
We find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering.
arXiv Detail & Related papers (2025-03-11T17:59:53Z) - PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian [19.816050739495573]
PerCul is a dataset designed to assess the sensitivity of LLMs toward Persian culture.<n>PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios.<n>We evaluate several state-of-the-art multilingual and Persian-specific LLMs.
arXiv Detail & Related papers (2025-02-11T11:07:44Z) - Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning [13.034603322224548]
We present a simple and inexpensive method that uses a combination of in-context learning (ICL) and human survey data.
We show that our method could prove useful in test languages other than English and can improve alignment to the cultural values that correspond to a range of culturally diverse countries.
arXiv Detail & Related papers (2024-08-29T12:18:04Z) - Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios [29.56889133557681]
This research evaluates the performance of seven leading Large Language Models (LLMs) in sentiment analysis on a dataset derived from WhatsApp chats.
We find that while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances.
GPT-4 and GPT-4-Turbo excelled in grasping diverse linguistic inputs and managing various contextual information.
arXiv Detail & Related papers (2024-06-01T07:36:59Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models [26.64843536942309]
Large language models (LLMs) may need to adapt outputs to user values and cultures, not just know about them.<n>We introduce NormAd, an evaluation framework to assess LLMs' cultural adaptability.<n>We create NormAd-Eti, a benchmark of 2.6k situational descriptions representing social-etiquette related cultural norms from 75 countries.
arXiv Detail & Related papers (2024-04-18T18:48:50Z) - CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset.
Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions.
CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.