EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
- URL: http://arxiv.org/abs/2403.10378v1
- Date: Fri, 15 Mar 2024 15:08:39 GMT
- Title: EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
- Authors: Rocktim Jyoti Das, Simeon Emilov Hristov, Haonan Li, Dimitar Iliyanov Dimitrov, Ivan Koychev, Preslav Nakov,
- Abstract summary: EXAMS-V is a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models.
It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies.
The questions come in 11 languages from 7 language families.
- Score: 29.31649801849329
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 languages from 7 language families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering school exam questions from various countries, with a variety of education systems. This distinctive approach calls for intricate reasoning across diverse languages and relies on region-specific knowledge. Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image. Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision-text models such as GPT-4V and Gemini; this underscores the inherent complexity of the dataset and its significance as a future benchmark.
Related papers
- WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.
We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.
This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z) - NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models [43.98941258781775]
We introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models.
The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India.
arXiv Detail & Related papers (2024-07-15T01:21:56Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z) - VNHSGE: VietNamese High School Graduation Examination Dataset for Large
Language Models [0.0]
This article introduces the VNHSGE dataset, developed exclusively for evaluating large language models (LLMs)
The dataset covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests.
300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics.
arXiv Detail & Related papers (2023-05-20T14:13:08Z) - M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark
for Chinese Large Language Models [35.17226595231825]
M3KE is a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark.
It is developed to measure knowledge acquired by Chinese large language models.
We have collected 20,477 questions from 71 tasks.
arXiv Detail & Related papers (2023-05-17T14:56:31Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - EVJVQA Challenge: Multilingual Visual Question Answering [1.4641199499831683]
Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV)
EVJVQA is used as a benchmark dataset for the challenge of multilingual visual question answering at the 9th Workshop on Vietnamese Language and Speech Processing (VLSP 2022)
We present details of the organization of the challenge, an overview of the methods employed by shared-task participants, and the results.
arXiv Detail & Related papers (2023-02-23T02:38:39Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - EXAMS: A Multi-Subject High School Examinations Dataset for
Cross-Lingual and Multilingual Question Answering [22.926709247193724]
EXAMS is a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations.
We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences.
arXiv Detail & Related papers (2020-11-05T20:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.