Related papers: IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

URL: http://arxiv.org/abs/2511.04727v1
Date: Thu, 06 Nov 2025 18:01:22 GMT
Title: IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Authors: Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal,
Abstract summary: IndicVisionBench is the first large-scale benchmark centered on the Indian subcontinent.<n>Our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA)<n>In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs.
Score: 2.697578491761838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

Related papers

Multimodal Evaluation of Russian-language Architectures [88.00147763684451]
We introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures.<n>The benchmark is instruction-based and encompasses default text, image, audio, and video modalities.<n>Mera Multi provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages.
arXiv Detail & Related papers (2025-11-19T15:43:53Z)
HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples [3.3715057550177145]
We present a scalable framework to evaluate Vision-Language Models (VLMs) in Indian languages and compare it with performance in English.<n>Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples.<n>We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu.
arXiv Detail & Related papers (2025-11-19T07:11:00Z)
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models [56.775118098058506]
TowerVision is a family of open multilingual vision-language models for both image-text and video-text tasks.<n>By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches.<n>To support further research, we publicly release all models, data, and training recipes.
arXiv Detail & Related papers (2025-10-22T17:02:48Z)
BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models [54.16874020794336]
We introduce BLEnD-Vis, a benchmark designed to evaluate the robustness of everyday cultural knowledge in vision-language models (VLMs)<n> BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats.<n>The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation.
arXiv Detail & Related papers (2025-10-13T09:10:05Z)
MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z)
DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture [14.681676046750342]
DRISHTIKON is a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture.<n>The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage.<n>We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models.
arXiv Detail & Related papers (2025-09-23T17:40:43Z)
Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation [20.109615198034394]
We propose Kaleidoscope as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models.<n>Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions.<n>We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios.
arXiv Detail & Related papers (2025-04-09T17:43:16Z)
PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model [75.98106427999411]
We propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Models.<n>It features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons.<n>It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications.
arXiv Detail & Related papers (2025-03-24T09:38:37Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.