Related papers: NormAd: A Benchmark for Measuring the Cultural Adaptability of Large Language Models

NormAd: A Benchmark for Measuring the Cultural Adaptability of Large Language Models

URL: http://arxiv.org/abs/2404.12464v5
Date: Thu, 11 Jul 2024 14:05:59 GMT
Title: NormAd: A Benchmark for Measuring the Cultural Adaptability of Large Language Models
Authors: Abhinav Rao, Akhila Yerukola, Vishwa Shah, Katharina Reinecke, Maarten Sap,
Abstract summary: We introduce NormAd, a novel dataset that includes 2.6k stories that represent social and cultural norms from 75 countries. Our study reveals that LLMs struggle with cultural reasoning across all contextual granularities. Evaluation on NormAd reveals that LLMs struggle to adapt to stories involving gift-giving across cultures.
Score: 26.64843536942309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of large language models (LLMs) into various global cultures fundamentally presents a challenge: LLMs must navigate interactions, respect social norms, and avoid transgressing cultural boundaries. However, it is still unclear if LLMs can adapt their outputs to diverse cultural norms. Our study focuses on this aspect. We introduce NormAd, a novel dataset, which includes 2.6k stories that represent social and cultural norms from 75 countries, to assess the ability of LLMs to adapt to different granular levels of socio-cultural contexts such as the country of origin, its associated cultural values, and prevalent social norms. Our study reveals that LLMs struggle with cultural reasoning across all contextual granularities, showing stronger adaptability to English-centric cultures over those from the Global South. Even with explicit social norms, the top-performing model, Mistral-7b-Instruct, achieves only 81.8% accuracy, lagging behind the 95.6% achieved by humans. Evaluation on NormAd further reveals that LLMs struggle to adapt to stories involving gift-giving across cultures. Due to inherent agreement or sycophancy biases, LLMs find it considerably easier to assess the social acceptability of stories that adhere to norms than those that deviate. Our benchmark measures the cultural adaptability (or lack thereof) of LLMs, emphasizing the potential to make these technologies more equitable and useful for global audiences. We release the NormAd dataset and its associated code on GitHub.

Related papers

Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment [16.702098536881127]
We introduce CulturalPersonas, the first large-scale benchmark with human validation for evaluating personality expression in behaviorally rich contexts.<n>Our dataset spans 3,000 scenario-based questions across six diverse countries, designed to elicit personality through everyday scenarios rooted in local values.<n>Our results show that CulturalPersonas improves alignment with country-specific human personality distributions.
arXiv Detail & Related papers (2025-06-06T01:33:19Z)
Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding? [17.231806929840015]
We evaluate five Indic and five global LLMs along two key dimensions: values and practices.<n>Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models.<n>We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data.
arXiv Detail & Related papers (2025-05-25T01:59:23Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies [58.88053690412802]
Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants.<n> CROSS is a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs.<n>We evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models.
arXiv Detail & Related papers (2025-05-20T23:20:38Z)
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models [20.411764699679058]
We show that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. We identify an explicit cultural customization vector, conserved across all non-English languages, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world.
arXiv Detail & Related papers (2025-04-14T12:53:58Z)
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions. Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora. We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z)
Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task. We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z)
Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench [37.63947763066401]
We introduce CQ-Bench, a benchmark designed to assess large language models' capability to infer implicit cultural values. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection, they still fall short in nuanced attitude detection. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning.
arXiv Detail & Related papers (2025-04-01T18:54:47Z)
Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology [4.079147243688765]
Large language models (LLMs) closely interact with humans, and need an intimate understanding of the cultural values of human society. Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress. Increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data.
arXiv Detail & Related papers (2024-12-12T00:52:11Z)
SafeWorld: Geo-Diverse Safety Alignment [107.84182558480859]
We introduce SafeWorld, a novel benchmark specifically designed to evaluate Large Language Models (LLMs) SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin.
arXiv Detail & Related papers (2024-12-09T13:31:46Z)
Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z)
Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models [4.771099208181585]
LLMs are increasingly deployed in global applications, ensuring users from diverse backgrounds feel respected and understood. Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values. We present two key contributions: A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and a culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators.
arXiv Detail & Related papers (2024-10-15T18:13:10Z)
Methodology of Adapting Large English Language Models for Specific Cultural Contexts [10.151487049108626]
We propose a rapid adaptation method for large models in specific cultural contexts. The adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values.
arXiv Detail & Related papers (2024-06-26T09:16:08Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models [38.932610459192105]
We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs) Our work underscores the value of using diverse data to create more inclusive multimodal systems.
arXiv Detail & Related papers (2024-05-22T16:04:22Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding. This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs) LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
Sociocultural Norm Similarities and Differences via Situational Alignment and Explainable Textual Entailment [31.929550141633218]
We propose a novel approach to discover and compare social norms across Chinese and American cultures. We build a high-quality dataset of 3,069 social norms aligned with social situations across Chinese and American cultures. To test the ability of models to reason about social norms across cultures, we introduce the task of explainable social norm entailment.
arXiv Detail & Related papers (2023-05-23T19:43:47Z)
NormSAGE: Multi-Lingual Multi-Cultural Norm Discovery from Conversations On-the-Fly [61.77957329364812]
We introduce a framework for addressing the novel task of conversation-grounded multi-lingual, multi-cultural norm discovery. NormSAGE elicits knowledge about norms through directed questions representing the norm discovery task and conversation context. It further addresses the risk of language model hallucination with a self-verification mechanism ensuring that the norms discovered are correct.
arXiv Detail & Related papers (2022-10-16T18:30:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.