CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge
- URL: http://arxiv.org/abs/2404.06664v1
- Date: Wed, 10 Apr 2024 00:25:09 GMT
- Title: CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge
- Authors: Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi,
- Abstract summary: We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset.
Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions.
CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
- Score: 69.82940934994333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
Related papers
- Towards Geo-Culturally Grounded LLM Generations [16.9281418974003]
Generative large language models (LLMs) have been demonstrated to have gaps in diverse, cultural knowledge across the globe.
We investigate the effect of retrieval augmented generation and search-grounding techniques on the ability of LLMs to display familiarity with a diverse range of national cultures.
arXiv Detail & Related papers (2025-02-19T07:29:58Z) - When One LLM Drools, Multi-LLM Collaboration Rules [98.71562711695991]
We argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people.
We organize existing multi-LLM collaboration methods into a hierarchy, based on the level of access and information exchange.
We envision multi-LLM collaboration as an essential path toward compositional intelligence and collaborative AI development.
arXiv Detail & Related papers (2025-02-06T21:13:44Z) - CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.
CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.
We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z) - Toward Inclusive Educational AI: Auditing Frontier LLMs through a Multiplexity Lens [1.094065133109559]
This paper proposes a framework to assess and mitigate cultural bias within large language models (LLMs)
Our analysis reveals that LLMs frequently exhibit cultural polarization, with biases appearing in both overt and subtle contextual cues.
We propose two strategies: textitContextually-Implemented Multiplex LLMs, which embed multiplex principles directly into the system prompt, and textitMulti-Agent System (MAS)-Implemented Multiplex LLMs, where multiple LLM agents, each representing distinct cultural viewpoints, collaboratively generate a balanced, synthesized response.
arXiv Detail & Related papers (2025-01-02T11:27:08Z) - LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output [8.435090588116973]
We propose the LLM-GLOBE benchmark for evaluating the cultural value systems of LLMs.
We then leverage the benchmark to compare the values of Chinese and US LLMs.
Our methodology includes a novel "LLMs-as-a-Jury" pipeline which automates the evaluation of open-ended content.
arXiv Detail & Related papers (2024-11-09T01:38:55Z) - Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms.
Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations.
Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z) - Investigating the Role of Cultural Values in Adopting Large Language Models for Software Engineering [17.818350887316004]
This study focuses on the role of professionals' cultural values in the adoption of Large Language Models (LLMs) in software development.
We found that habit and performance expectancy are the primary drivers of LLM adoption, while cultural values do not significantly moderate this process.
arXiv Detail & Related papers (2024-09-08T10:58:45Z) - Translating Across Cultures: LLMs for Intralingual Cultural Adaptation [12.5954253354303]
We define the task of cultural adaptation and create an evaluation framework to evaluate the performance of modern LLMs.
We analyze possible issues with automatic adaptation.
We hope that this paper will offer more insight into the cultural understanding of LLMs and their creativity in cross-cultural scenarios.
arXiv Detail & Related papers (2024-06-20T17:06:58Z) - Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding.
This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.