Related papers: NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models

URL: http://arxiv.org/abs/2404.12464v9
Date: Thu, 06 Mar 2025 16:13:04 GMT
Title: NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models
Authors: Abhinav Rao, Akhila Yerukola, Vishwa Shah, Katharina Reinecke, Maarten Sap,
Abstract summary: Large language models (LLMs) may need to adapt outputs to user values and cultures, not just know about them.<n>We introduce NormAd, an evaluation framework to assess LLMs' cultural adaptability.<n>We create NormAd-Eti, a benchmark of 2.6k situational descriptions representing social-etiquette related cultural norms from 75 countries.
Score: 26.64843536942309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To be effectively and safely deployed to global user populations, large language models (LLMs) may need to adapt outputs to user values and cultures, not just know about them. We introduce NormAd, an evaluation framework to assess LLMs' cultural adaptability, specifically measuring their ability to judge social acceptability across varying levels of cultural norm specificity, from abstract values to explicit social norms. As an instantiation of our framework, we create NormAd-Eti, a benchmark of 2.6k situational descriptions representing social-etiquette related cultural norms from 75 countries. Through comprehensive experiments on NormAd-Eti, we find that LLMs struggle to accurately judge social acceptability across these varying degrees of cultural contexts and show stronger adaptability to English-centric cultures over those from the Global South. Even in the simplest setting where the relevant social norms are provided, the best LLMs' performance (< 82\%) lags behind humans (> 95\%). In settings with abstract values and country information, model performance drops substantially (< 60\%), while human accuracy remains high (> 90\%). Furthermore, we find that models are better at recognizing socially acceptable versus unacceptable situations. Our findings showcase the current pitfalls in socio-cultural reasoning of LLMs which hinder their adaptability for global audiences.

Related papers

Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment [16.702098536881127]
We introduce CulturalPersonas, the first large-scale benchmark with human validation for evaluating personality expression in behaviorally rich contexts.<n>Our dataset spans 3,000 scenario-based questions across six diverse countries, designed to elicit personality through everyday scenarios rooted in local values.<n>Our results show that CulturalPersonas improves alignment with country-specific human personality distributions.
arXiv Detail & Related papers (2025-06-06T01:33:19Z)
Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding? [17.231806929840015]
We evaluate five Indic and five global LLMs along two key dimensions: values and practices.<n>Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models.<n>We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data.
arXiv Detail & Related papers (2025-05-25T01:59:23Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies [58.88053690412802]
Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants.<n> CROSS is a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs.<n>We evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models.
arXiv Detail & Related papers (2025-05-20T23:20:38Z)
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models [20.411764699679058]
We show that explicitly providing cultural context in prompts significantly improves the models' ability to generate culturally localized responses. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. We identify an explicit cultural customization vector, conserved across all non-English languages, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world.
arXiv Detail & Related papers (2025-04-14T12:53:58Z)
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions. Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora. We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z)
Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task. We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z)
Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Metacognitive Cultural Intelligence with CQ-Bench [37.63947763066401]
We introduce CQ-Bench, a benchmark designed to assess large language models' capability to infer implicit cultural values. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection, they still fall short in nuanced attitude detection. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning.
arXiv Detail & Related papers (2025-04-01T18:54:47Z)
Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology [4.079147243688765]
Large language models (LLMs) closely interact with humans, and need an intimate understanding of the cultural values of human society. Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress. Increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data.
arXiv Detail & Related papers (2024-12-12T00:52:11Z)
SafeWorld: Geo-Diverse Safety Alignment [107.84182558480859]
We introduce SafeWorld, a novel benchmark specifically designed to evaluate Large Language Models (LLMs) SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races. Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin.
arXiv Detail & Related papers (2024-12-09T13:31:46Z)
Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms. Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations. Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z)
Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models [4.771099208181585]
LLMs are increasingly deployed in global applications, ensuring users from diverse backgrounds feel respected and understood. Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values. We present two key contributions: A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and a culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators.
arXiv Detail & Related papers (2024-10-15T18:13:10Z)
Methodology of Adapting Large English Language Models for Specific Cultural Contexts [10.151487049108626]
We propose a rapid adaptation method for large models in specific cultural contexts. The adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values.
arXiv Detail & Related papers (2024-06-26T09:16:08Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models [38.932610459192105]
We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs) Our work underscores the value of using diverse data to create more inclusive multimodal systems.
arXiv Detail & Related papers (2024-05-22T16:04:22Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding. This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs) LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
Sociocultural Norm Similarities and Differences via Situational Alignment and Explainable Textual Entailment [31.929550141633218]
We propose a novel approach to discover and compare social norms across Chinese and American cultures. We build a high-quality dataset of 3,069 social norms aligned with social situations across Chinese and American cultures. To test the ability of models to reason about social norms across cultures, we introduce the task of explainable social norm entailment.
arXiv Detail & Related papers (2023-05-23T19:43:47Z)
NormSAGE: Multi-Lingual Multi-Cultural Norm Discovery from Conversations On-the-Fly [61.77957329364812]
We introduce a framework for addressing the novel task of conversation-grounded multi-lingual, multi-cultural norm discovery. NormSAGE elicits knowledge about norms through directed questions representing the norm discovery task and conversation context. It further addresses the risk of language model hallucination with a self-verification mechanism ensuring that the norms discovered are correct.
arXiv Detail & Related papers (2022-10-16T18:30:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.