Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking
- URL: http://arxiv.org/abs/2402.09369v1
- Date: Wed, 14 Feb 2024 18:16:54 GMT
- Title: Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking
- Authors: Yi Fung, Ruining Zhao, Jae Doo, Chenkai Sun, Heng Ji
- Abstract summary: This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
- Score: 48.21982147529661
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Pretrained large language models have revolutionized many applications but
still face challenges related to cultural bias and a lack of cultural
commonsense knowledge crucial for guiding cross-culture communication and
interactions. Recognizing the shortcomings of existing methods in capturing the
diverse and rich cultures across the world, this paper introduces a novel
approach for massively multicultural knowledge acquisition. Specifically, our
method strategically navigates from densely informative Wikipedia documents on
cultural topics to an extensive network of linked pages. Leveraging this
valuable source of data collection, we construct the CultureAtlas dataset,
which covers a wide range of sub-country level geographical regions and
ethnolinguistic groups, with data cleaning and preprocessing to ensure textual
assertion sentence self-containment, as well as fine-grained cultural profile
information extraction. Our dataset not only facilitates the evaluation of
language model performance in culturally diverse contexts but also serves as a
foundational tool for the development of culturally sensitive and aware
language models. Our work marks an important step towards deeper understanding
and bridging the gaps of cultural disparities in AI, to promote a more
inclusive and balanced representation of global cultures in the digital domain.
Related papers
- CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.
CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.
We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z) - Risks of Cultural Erasure in Large Language Models [4.613949381428196]
We argue for the need of metricizable evaluations of language technologies that interrogate and account for historical power inequities.
We probe representations that a language model produces about different places around the world when asked to describe these contexts.
We analyze the cultures represented in the travel recommendations produced by a set of language model applications.
arXiv Detail & Related papers (2025-01-02T04:57:50Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations.
We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z) - Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge [47.57055368312541]
We introduce FmLAMA, a multilingual dataset centered on food-related cultural facts and variations in food practices.
We analyze LLMs across various architectures and configurations, evaluating their performance in both monolingual and multilingual settings.
arXiv Detail & Related papers (2024-04-10T08:49:27Z) - Investigating Cultural Alignment of Large Language Models [10.738300803676655]
We show that Large Language Models (LLMs) genuinely encapsulate the diverse knowledge adopted by different cultures.
We quantify cultural alignment by simulating sociological surveys, comparing model responses to those of actual survey participants as references.
We introduce Anthropological Prompting, a novel method leveraging anthropological reasoning to enhance cultural alignment.
arXiv Detail & Related papers (2024-02-20T18:47:28Z) - Enhancing Content Moderation with Culturally-Aware Models [9.890160776193616]
This work introduces a flexible framework that enhances foundation language models with cultural knowledge.
We evaluate this framework in a case study of an online podcast platform with content spanning various regions.
arXiv Detail & Related papers (2023-12-05T00:11:09Z) - Cultural Compass: Predicting Transfer Learning Success in Offensive Language Detection with Cultural Features [19.72091739119933]
Our study delves into the intersection of cultural features and transfer learning effectiveness.
Based on these results, we advocate for the integration of cultural information into datasets.
Our research signifies a step forward in the quest for more inclusive, culturally sensitive language technologies.
arXiv Detail & Related papers (2023-10-10T09:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.