Related papers: Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries

Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries

URL: http://arxiv.org/abs/2512.01419v1
Date: Mon, 01 Dec 2025 08:55:41 GMT
Title: Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries
Authors: Tushar Pranav, Eshan Pandey, Austria Lyka Diane Bala, Aman Chadha, Indriyati Atmosukarto, Donny Soh Cheng Lock,
Abstract summary: Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases.<n>RICE-VL is a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries.<n>We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification.
Score: 12.306502538150125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.

Related papers

LLMs and Cultural Values: the Impact of Prompt Language and Explicit Cultural Framing [0.21485350418225244]
Large Language Models (LLMs) are rapidly being adopted by users across the globe, who interact with them in a diverse range of languages.<n>We examine how prompt language and cultural framing influence model responses and their alignment with human values in different countries.
arXiv Detail & Related papers (2025-11-06T02:09:29Z)
BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models [54.16874020794336]
We introduce BLEnD-Vis, a benchmark designed to evaluate the robustness of everyday cultural knowledge in vision-language models (VLMs)<n> BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats.<n>The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation.
arXiv Detail & Related papers (2025-10-13T09:10:05Z)
MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z)
Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World [68.19795061447044]
This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world.<n>Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods.<n>Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average.
arXiv Detail & Related papers (2025-09-23T17:24:14Z)
CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z)
Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation [2.0467354053171243]
We present the first comprehensive evaluation of Vision-Language Models (VLMs) cultural competence through multimodal story generation.<n>Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers.<n>We uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments.
arXiv Detail & Related papers (2025-08-22T19:39:02Z)
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding [79.44246283490665]
We introduce RAVENEA, a new benchmark designed to advance visual culture understanding through retrieval.<n>RAVENEA focuses on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC)<n>We train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-05-20T14:57:16Z)
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.<n>CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.<n>We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z)
Benchmarking Vision Language Models for Cultural Understanding [31.898921287065242]
This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing Vision Language Models (VLMs) We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions.
arXiv Detail & Related papers (2024-07-15T17:21:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.