Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4
- URL: http://arxiv.org/abs/2505.18322v1
- Date: Fri, 23 May 2025 19:28:00 GMT
- Title: Is It Bad to Work All the Time? Cross-Cultural Evaluation of Social Norm Biases in GPT-4
- Authors: Zhuozhuo Joy Liu, Farhan Samir, Mehar Bhatia, Laura K. Nelson, Vered Shwartz,
- Abstract summary: We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific.<n>While it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model.
- Score: 10.665717152882312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLMs have been demonstrated to align with the values of Western or North American cultures. Prior work predominantly showed this effect through leveraging surveys that directly ask (originally people and now also LLMs) about their values. However, it is hard to believe that LLMs would consistently apply those values in real-world scenarios. To address that, we take a bottom-up approach, asking LLMs to reason about cultural norms in narratives from different cultures. We find that GPT-4 tends to generate norms that, while not necessarily incorrect, are significantly less culture-specific. In addition, while it avoids overtly generating stereotypes, the stereotypical representations of certain cultures are merely hidden rather than suppressed in the model, and such stereotypes can be easily recovered. Addressing these challenges is a crucial step towards developing LLMs that fairly serve their diverse user base.
Related papers
- Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding? [17.231806929840015]
We evaluate five Indic and five global LLMs along two key dimensions: values and practices.<n>Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models.<n>We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data.
arXiv Detail & Related papers (2025-05-25T01:59:23Z) - From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z) - An Evaluation of Cultural Value Alignment in LLM [27.437888319382893]
We conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs.<n>Our findings show that the output over all models represents a moderate cultural middle ground.<n> Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output.
arXiv Detail & Related papers (2025-04-11T09:13:19Z) - Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [50.16340812031201]
We show that large language models (LLMs) do not update their beliefs as expected from the Bayesian framework.<n>We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model.
arXiv Detail & Related papers (2025-03-21T20:13:04Z) - Large Language Models Reflect the Ideology of their Creators [71.65505524599888]
Large language models (LLMs) are trained on vast amounts of data to generate natural language.<n>This paper shows that the ideological stance of an LLM appears to reflect the worldview of its creators.
arXiv Detail & Related papers (2024-10-24T04:02:30Z) - CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs [75.82306181299153]
We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for assessing cultural knowledge.
We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently.
Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%.
arXiv Detail & Related papers (2024-10-03T17:04:31Z) - See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions [9.275967682881944]
It is critical to understand whether Large Language Models showcase different values to the user based on the stereotypical values of a user's known country.<n>We prompt different LLMs with a series of advice requests based on 5 Hofstede Cultural Dimensions.<n>We found that LLMs can differentiate between one side of a value and another, as well as understand that countries have differing values.
arXiv Detail & Related papers (2024-06-21T00:58:01Z) - NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models [26.64843536942309]
Large language models (LLMs) may need to adapt outputs to user values and cultures, not just know about them.<n>We introduce NormAd, an evaluation framework to assess LLMs' cultural adaptability.<n>We create NormAd-Eti, a benchmark of 2.6k situational descriptions representing social-etiquette related cultural norms from 75 countries.
arXiv Detail & Related papers (2024-04-18T18:48:50Z) - Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models [61.45529177682614]
We challenge the prevailing constrained evaluation paradigm for values and opinions in large language models.
We show that models give substantively different answers when not forced.
We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
arXiv Detail & Related papers (2024-02-26T18:00:49Z) - Assessing LLMs for Moral Value Pluralism [2.860608352191896]
We utilize a Recognizing Value Resonance (RVR) NLP model to identify World Values Survey (WVS) values that resonate and conflict with a given passage of text.
We find that LLMs exhibit several Western-centric value biases.
Our results highlight value misalignment and age groups, and a need for social science informed technological solutions.
arXiv Detail & Related papers (2023-12-08T16:18:15Z) - Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs [67.51906565969227]
We study the unintended side-effects of persona assignment on the ability of LLMs to perform basic reasoning tasks.
Our study covers 24 reasoning datasets, 4 LLMs, and 19 diverse personas (e.g. an Asian person) spanning 5 socio-demographic groups.
arXiv Detail & Related papers (2023-11-08T18:52:17Z) - Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in
Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs)
LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.