Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies
- URL: http://arxiv.org/abs/2505.14972v1
- Date: Tue, 20 May 2025 23:20:38 GMT
- Title: Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies
- Authors: Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng,
- Abstract summary: Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants.<n> CROSS is a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs.<n>We evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models.
- Score: 58.88053690412802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o's cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.
Related papers
- CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics [23.567641319277943]
We quantify the alignment of text-to-image (T2I) models and evaluation metrics with respect to both explicit and implicit cultural expectations.<n>We introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations.<n>We find that T2I models fail to meet the more challenging implicit expectations but also the less challenging explicit expectations.
arXiv Detail & Related papers (2025-06-10T14:21:46Z) - RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding [79.44246283490665]
We introduce RAVENEA, a new benchmark designed to advance visual culture understanding through retrieval.<n>RAVENEA focuses on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC)<n>We train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-05-20T14:57:16Z) - CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions.<n>Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora.<n>We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z) - GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking [29.664707739055068]
We introduce GIMMICK, an extensive benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries.<n>GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets.<n>We examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues.
arXiv Detail & Related papers (2025-02-19T14:27:40Z) - CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.<n>CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.<n>We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z) - SafeWorld: Geo-Diverse Safety Alignment [107.84182558480859]
We introduce SafeWorld, a novel benchmark specifically designed to evaluate Large Language Models (LLMs)<n>SafeWorld encompasses 2,342 test user queries, each grounded in high-quality, human-verified cultural norms and legal policies from 50 countries and 493 regions/races.<n>Our trained SafeWorldLM outperforms all competing models, including GPT-4o on all three evaluation dimensions by a large margin.
arXiv Detail & Related papers (2024-12-09T13:31:46Z) - Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models [4.771099208181585]
LLMs are increasingly deployed in global applications, ensuring users from diverse backgrounds feel respected and understood.<n>Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values.<n>We present two key contributions: A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and a culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators.
arXiv Detail & Related papers (2024-10-15T18:13:10Z) - Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in
Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs)
LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.