CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models
- URL: http://arxiv.org/abs/2311.16421v3
- Date: Thu, 20 Jun 2024 11:52:47 GMT
- Title: CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models
- Authors: Yuhang Wang, Yanxu Zhu, Chao Kong, Shuyu Wei, Xiaoyuan Yi, Xing Xie, Jitao Sang,
- Abstract summary: CDEval is a benchmark aimed at evaluating the cultural dimensions of Large Language Models.
It is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains.
- Score: 41.885600036131045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.
Related papers
- All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.
It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages.
The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z) - LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output [8.435090588116973]
We propose the LLM-GLOBE benchmark for evaluating the cultural value systems of LLMs.
We then leverage the benchmark to compare the values of Chinese and US LLMs.
Our methodology includes a novel "LLMs-as-a-Jury" pipeline which automates the evaluation of open-ended content.
arXiv Detail & Related papers (2024-11-09T01:38:55Z) - Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models [4.771099208181585]
LLMs are increasingly deployed in global applications, ensuring users from diverse backgrounds feel respected and understood.
Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values.
We present two key contributions: A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and a culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators.
arXiv Detail & Related papers (2024-10-15T18:13:10Z) - Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture [4.467334566487944]
This study introduces a benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge.
The study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, evaluating, and Creating.
The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge.
arXiv Detail & Related papers (2024-09-03T02:50:04Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding.
This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z) - CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations.
We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z) - CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset.
Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions.
CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z) - Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions [10.415002561977655]
This research proposes a Cultural Alignment Test (Hoftede's CAT) to quantify cultural alignment using Hofstede's cultural dimension framework.
We quantitatively evaluate large language models (LLMs) against the cultural dimensions of regions like the United States, China, and Arab countries.
Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions.
arXiv Detail & Related papers (2023-08-25T14:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.