Related papers: EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models

EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models

URL: http://arxiv.org/abs/2403.10378v1
Date: Fri, 15 Mar 2024 15:08:39 GMT
Title: EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
Authors: Rocktim Jyoti Das, Simeon Emilov Hristov, Haonan Li, Dimitar Iliyanov Dimitrov, Ivan Koychev, Preslav Nakov,
Abstract summary: EXAMS-V is a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies. The questions come in 11 languages from 7 language families.
Score: 29.31649801849329
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.

Related papers

VLM@school -- Evaluation of AI image understanding on German middle school knowledge [0.0]
This paper introduces a novel benchmark dataset designed to evaluate the capabilities of Vision Language Models (VLMs)<n>This dataset draws from real middle school curricula across nine domains including mathematics, history, biology, and religion.<n>We evaluate thirteen state-of-the-art open-weight VLMs across multiple dimensions, including domain-specific accuracy and performance on adversarial crafted questions.
arXiv Detail & Related papers (2025-06-13T09:20:41Z)
Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation [20.109615198034394]
We propose Kaleidoscope as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios.
arXiv Detail & Related papers (2025-04-09T17:43:16Z)
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [50.43793764203352]
We introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations. Our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade. It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions.
arXiv Detail & Related papers (2025-04-08T08:06:53Z)
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark [74.4821011648997]
MAIA is a benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos. It evaluates Vision Language Models (VLMs) on two aligned tasks. It considers twelve reasoning categories that aim to disentangle language and vision relations.
arXiv Detail & Related papers (2025-02-24T09:25:51Z)
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English. We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z)
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models [43.98941258781775]
We introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India.
arXiv Detail & Related papers (2024-07-15T01:21:56Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations. We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models. The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context. M3Exam contains 12,317 questions in 9 diverse languages with three educational levels. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z)
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models [0.0]
This article introduces the VNHSGE dataset, developed exclusively for evaluating large language models (LLMs) The dataset covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics.
arXiv Detail & Related papers (2023-05-20T14:13:08Z)
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models [35.17226595231825]
M3KE is a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark. It is developed to measure knowledge acquired by Chinese large language models. We have collected 20,477 questions from 71 tasks.
arXiv Detail & Related papers (2023-05-17T14:56:31Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
EVJVQA Challenge: Multilingual Visual Question Answering [1.4641199499831683]
Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV) EVJVQA is used as a benchmark dataset for the challenge of multilingual visual question answering at the 9th Workshop on Vietnamese Language and Speech Processing (VLSP 2022) We present details of the organization of the challenge, an overview of the methods employed by shared-task participants, and the results.
arXiv Detail & Related papers (2023-02-23T02:38:39Z)
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects. We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z)
EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering [22.926709247193724]
EXAMS is a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations. We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences.
arXiv Detail & Related papers (2020-11-05T20:06:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.