LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output
- URL: http://arxiv.org/abs/2411.06032v1
- Date: Sat, 09 Nov 2024 01:38:55 GMT
- Title: LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output
- Authors: Elise Karinshak, Amanda Hu, Kewen Kong, Vishwanatha Rao, Jingren Wang, Jindong Wang, Yi Zeng,
- Abstract summary: We propose the LLM-GLOBE benchmark for evaluating the cultural value systems of LLMs.
We then leverage the benchmark to compare the values of Chinese and US LLMs.
Our methodology includes a novel "LLMs-as-a-Jury" pipeline which automates the evaluation of open-ended content.
- Score: 8.435090588116973
- License:
- Abstract: Immense effort has been dedicated to minimizing the presence of harmful or biased generative content and better aligning AI output to human intention; however, research investigating the cultural values of LLMs is still in very early stages. Cultural values underpin how societies operate, providing profound insights into the norms, priorities, and decision making of their members. In recognition of this need for further research, we draw upon cultural psychology theory and the empirically-validated GLOBE framework to propose the LLM-GLOBE benchmark for evaluating the cultural value systems of LLMs, and we then leverage the benchmark to compare the values of Chinese and US LLMs. Our methodology includes a novel "LLMs-as-a-Jury" pipeline which automates the evaluation of open-ended content to enable large-scale analysis at a conceptual level. Results clarify similarities and differences that exist between Eastern and Western cultural value systems and suggest that open-generation tasks represent a more promising direction for evaluation of cultural values. We interpret the implications of this research for subsequent model development, evaluation, and deployment efforts as they relate to LLMs, AI cultural alignment more broadly, and the influence of AI cultural value systems on human-AI collaboration outcomes.
Related papers
- Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative.
Existing evaluations focus narrowly on safety risks such as bias and toxicity.
Existing benchmarks are prone to data contamination.
The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment.
arXiv Detail & Related papers (2025-01-13T05:53:56Z) - CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.
CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.
We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z) - Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology [4.079147243688765]
Large language models (LLMs) closely interact with humans, and need an intimate understanding of the cultural values of human society.
Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress.
Increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data.
arXiv Detail & Related papers (2024-12-12T00:52:11Z) - Investigating the Role of Cultural Values in Adopting Large Language Models for Software Engineering [17.818350887316004]
This study focuses on the role of professionals' cultural values in the adoption of Large Language Models (LLMs) in software development.
We found that habit and performance expectancy are the primary drivers of LLM adoption, while cultural values do not significantly moderate this process.
arXiv Detail & Related papers (2024-09-08T10:58:45Z) - Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture [4.467334566487944]
This study introduces a benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge.
The study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, evaluating, and Creating.
The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge.
arXiv Detail & Related papers (2024-09-03T02:50:04Z) - Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding.
This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z) - Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch.
Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs.
We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z) - CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset.
Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions.
CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z) - CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models [41.885600036131045]
CDEval is a benchmark aimed at evaluating the cultural dimensions of Large Language Models.
It is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains.
arXiv Detail & Related papers (2023-11-28T02:01:25Z) - Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans.
We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.