CultureLLM: Incorporating Cultural Differences into Large Language Models
- URL: http://arxiv.org/abs/2402.10946v3
- Date: Tue, 03 Dec 2024 06:52:34 GMT
- Title: CultureLLM: Incorporating Cultural Differences into Large Language Models
- Authors: Cheng Li, Mengzhou Chen, Jindong Wang, Sunayana Sitaram, Xing Xie,
- Abstract summary: CultureLLM is a cost-effective solution to incorporate cultural differences into large language models.<n>We fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9 cultures covering rich and low-resource languages.<n>Our human study shows that the generated samples are semantically equivalent to the original samples.
- Score: 36.66184989869121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are reported to be partial to certain cultures owing to the training data dominance from the English corpora. Since multilingual cultural data are often expensive to collect, existing efforts handle this by prompt engineering or culture-specific pre-training. However, they might overlook the knowledge deficiency of low-resource culture and require extensive computing resources. In this paper, we propose CultureLLM, a cost-effective solution to incorporate cultural differences into LLMs. CultureLLM adopts World Value Survey (WVS) as seed data and generates semantically equivalent training data via the proposed semantic data augmentation. Using only 50 seed samples from WVS with augmented data, we fine-tune culture-specific LLMs and one unified model (CultureLLM-One) for 9 cultures covering rich and low-resource languages. Extensive experiments on 60 culture-related datasets demonstrate that CultureLLM significantly outperforms various counterparts such as GPT-3.5 (by 8.1%) and Gemini Pro (by 9.5%) with comparable performance to GPT-4 or even better. Our human study shows that the generated samples are semantically equivalent to the original samples, providing an effective solution for LLMs augmentation. Code is released at https://github.com/Scarelette/CultureLLM.
Related papers
- CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions.
Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora.
We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z) - Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task.
We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z) - Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs [2.5212698425008377]
Large Language Models (LLMs) are becoming increasingly capable across global languages.
However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations.
We compare two families of models: Google's Gemma models and OpenAI's turbo-series.
We find no consistent relationships between language capabilities and cultural alignment.
arXiv Detail & Related papers (2025-02-23T11:02:41Z) - Self-Pluralising Culture Alignment for Large Language Models [36.689491885394034]
We propose CultureSPA, a framework that allows large language models to align to pluralistic cultures.
By comparing culture-aware/unaware outputs, we are able to detect and collect culture-related instances.
Extensive experiments demonstrate that CultureSPA significantly improves the alignment of LLMs to diverse cultures without compromising general abilities.
arXiv Detail & Related papers (2024-10-16T19:06:08Z) - Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning [13.034603322224548]
We present a simple and inexpensive method that uses a combination of in-context learning (ICL) and human survey data.
We show that our method could prove useful in test languages other than English and can improve alignment to the cultural values that correspond to a range of culturally diverse countries.
arXiv Detail & Related papers (2024-08-29T12:18:04Z) - Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models [22.92083941222383]
We introduce DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans.
We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models.
Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems.
arXiv Detail & Related papers (2024-07-02T08:55:41Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations.
We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in
Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs)
LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.