Related papers: Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

URL: http://arxiv.org/abs/2504.07072v2
Date: Tue, 29 Apr 2025 07:52:17 GMT
Title: Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation
Authors: Israfel Salazar, Manuel Fernández Burda, Shayekh Bin Islam, Arshia Soltani Moakhar, Shivalika Singh, Fabian Farestam, Angelika Romanou, Danylo Boiko, Dipika Khullar, Mike Zhang, Dominik Krzemiński, Jekaterina Novikova, Luísa Shimabucoro, Joseph Marvin Imperial, Rishabh Maheshwary, Sharad Duwal, Alfonso Amayuelas, Swati Rajwal, Jebish Purbey, Ahmed Ruby, Nicholas Popovič, Marek Suppa, Azmine Toushik Wasi, Ram Mohan Rao Kadiyala, Olga Tsymboi, Maksim Kostritsya, Bardia Soltani Moakhar, Gabriel da Costa Merlin, Otávio Ferracioli Coletti, Maral Jabbari Shiviari, MohammadAmin farahani fard, Silvia Fernandez, María Grandury, Dmitry Abulkhanov, Drishti Sharma, Andre Guarnier De Mitri, Leticia Bossatto Marchezi, Setayesh Heydari, Johan Obando-Ceron, Nazar Kohut, Beyza Ermis, Desmond Elliott, Enzo Ferrante, Sara Hooker, Marzieh Fadaee,
Abstract summary: We propose Kaleidoscope as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models.<n>Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions.<n>We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios.
Score: 20.109615198034394
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

Related papers

Disentangling Language and Culture for Evaluating Multilingual Large Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv Detail & Related papers (2025-05-30T14:25:45Z)
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs [38.26693373272882]
We introduce two new benchmarks: KnowRecall and VisRecall.<n>KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages.<n>VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images.
arXiv Detail & Related papers (2025-05-21T03:43:37Z)
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models [11.714753007667941]
GlotEval is a lightweight framework designed for massively multilingual evaluation.<n>It supports seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation) spanning over dozens to hundreds of languages.<n>It enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts.
arXiv Detail & Related papers (2025-04-05T12:30:58Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [60.52580061637301]
MMLU-ProX is a comprehensive benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language.<n>We evaluate 25 state-of-the-art large language models (LLMs) using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries.<n>Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
Evaluation of Multilingual Image Captioning: How far can we get with CLIP models? [3.902360015414256]
This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings.<n>Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages.
arXiv Detail & Related papers (2025-02-10T16:00:00Z)
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model [66.17354128553244]
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data. We investigate how different training mixes tip the scale for different groups of languages. We train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
arXiv Detail & Related papers (2025-01-09T10:26:14Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge. It focuses on ad hoc retrieval across 18 different languages. Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.