BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models
- URL: http://arxiv.org/abs/2510.11178v1
- Date: Mon, 13 Oct 2025 09:10:05 GMT
- Title: BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models
- Authors: Bryan Chen Zhengyu Tan, Zheng Weihua, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, Roy Ka-Wei Lee,
- Abstract summary: We introduce BLEnD-Vis, a benchmark designed to evaluate the robustness of everyday cultural knowledge in vision-language models (VLMs)<n> BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats.<n>The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation.
- Score: 54.16874020794336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\to$ Entity, (ii) an inverted text-only variant (Entity $\to$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.
Related papers
- Vision Language Models are Confused Tourists [31.85723694463742]
We introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess Vision-Language Models (VLMs)<n>Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant.<n>These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs.
arXiv Detail & Related papers (2025-11-21T07:14:46Z) - MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z) - Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation [2.0467354053171243]
We present the first comprehensive evaluation of Vision-Language Models (VLMs) cultural competence through multimodal story generation.<n>Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers.<n>We uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments.
arXiv Detail & Related papers (2025-08-22T19:39:02Z) - Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation [45.551223552275424]
Vision-Language Translation is a challenging task that requires accurately recognizing multilingual text embedded in images.<n>We present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics.
arXiv Detail & Related papers (2025-06-13T14:23:38Z) - RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding [79.44246283490665]
We introduce RAVENEA, a new benchmark designed to advance visual culture understanding through retrieval.<n>RAVENEA focuses on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC)<n>We train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-05-20T14:57:16Z) - Uncovering Cultural Representation Disparities in Vision-Language Models [45.032609066023504]
Vision-Language Models (VLMs) have demonstrated impressive capabilities across a range of tasks, yet concerns about their potential biases exist.<n>This work investigates the extent to which prominent VLMs exhibit cultural biases by evaluating their performance on an image-based country identification task at a country level.
arXiv Detail & Related papers (2025-05-20T02:04:09Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.