Related papers: Attributing Culture-Conditioned Generations to Pretraining Corpora

Attributing Culture-Conditioned Generations to Pretraining Corpora

URL: http://arxiv.org/abs/2412.20760v1
Date: Mon, 30 Dec 2024 07:09:25 GMT
Title: Attributing Culture-Conditioned Generations to Pretraining Corpora
Authors: Huihan Li, Arnav Goel, Keyu He, Xiang Ren,
Abstract summary: We analyze how models associate entities with cultures based on pretraining data patterns.<n>We find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none.
Score: 26.992883552982335
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In open-ended generative tasks like narrative writing or dialogue, large language models often exhibit cultural biases, showing limited knowledge and generating templated outputs for less prevalent cultures. Recent works show that these biases may stem from uneven cultural representation in pretraining corpora. This work investigates how pretraining leads to biased culture-conditioned generations by analyzing how models associate entities with cultures based on pretraining data patterns. We propose the MEMOed framework (MEMOrization from pretraining document) to determine whether a generation for a culture arises from memorization. Using MEMOed on culture-conditioned generations about food and clothing for 110 cultures, we find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none. Additionally, the model favors generating entities with extraordinarily high frequency regardless of the conditioned culture, reflecting biases toward frequent pretraining terms irrespective of relevance. We hope that the MEMOed framework and our insights will inspire more works on attributing model performance on pretraining data.

Related papers

CRAFT: Cultural Russian-Oriented Dataset Adaptation for Focused Text-to-Image Generation [3.566419648777424]
We examine the concept of cultural code and recognize the critical importance of its understanding by modern image generation models.<n>We propose the methodology for collecting and processing the data necessary to form a dataset based on the cultural code.<n>Human evaluation results demonstrate an increase in the level of awareness of Russian culture in the model.
arXiv Detail & Related papers (2025-05-07T23:29:28Z)
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions. Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora. We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z)
DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers [17.355452637877402]
We conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation.
arXiv Detail & Related papers (2025-04-03T08:52:42Z)
Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks. We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z)
Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition. Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages. Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z)
Cultural Bias and Cultural Alignment of Large Language Models [0.9374652839580183]
We conduct a disaggregated evaluation of cultural bias for five widely used large language models. All models exhibit cultural values resembling English-speaking and Protestant European countries. We suggest using cultural prompting and ongoing evaluation to reduce cultural bias in the output of generative AI.
arXiv Detail & Related papers (2023-11-23T16:45:56Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs) LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.