M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models
- URL: http://arxiv.org/abs/2306.05179v2
- Date: Fri, 10 Nov 2023 04:11:01 GMT
- Title: M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models
- Authors: Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia,
Lidong Bing
- Abstract summary: M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
- Score: 76.88692952308084
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite the existence of various benchmarks for evaluating natural language
processing models, we argue that human exams are a more suitable means of
evaluating general intelligence for large language models (LLMs), as they
inherently demand a much wider range of abilities such as language
understanding, domain knowledge, and problem-solving skills. To this end, we
introduce M3Exam, a novel benchmark sourced from real and official human exam
questions for evaluating LLMs in a multilingual, multimodal, and multilevel
context. M3Exam exhibits three unique characteristics: (1) multilingualism,
encompassing questions from multiple countries that require strong multilingual
proficiency and cultural knowledge; (2) multimodality, accounting for the
multimodal nature of many exam questions to test the model's multimodal
understanding capability; and (3) multilevel structure, featuring exams from
three critical educational periods to comprehensively assess a model's
proficiency at different levels. In total, M3Exam contains 12,317 questions in
9 diverse languages with three educational levels, where about 23\% of the
questions require processing images for successful solving. We assess the
performance of top-performing LLMs on M3Exam and find that current models,
including GPT-4, still struggle with multilingual text, particularly in
low-resource and non-Latin script languages. Multimodal LLMs also perform
poorly with complex multimodal questions. We believe that M3Exam can be a
valuable resource for comprehensively evaluating LLMs by examining their
multilingual and multimodal abilities and tracking their development. Data and
evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.
Related papers
- On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations? [19.346078451375693]
We present an analysis of existing evaluation frameworks in NLP.
We propose several directions for more robust and reliable evaluation practices.
We show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
arXiv Detail & Related papers (2024-06-20T12:46:12Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias [5.104497013562654]
We present an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities.
We explore widely utilized multilingual corpora for MLLMs' training and multilingual datasets oriented for downstream tasks.
We discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques.
arXiv Detail & Related papers (2024-04-01T05:13:56Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning [16.032320995230734]
We introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese.
CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school.
We propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions.
arXiv Detail & Related papers (2024-01-25T08:22:10Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark
for Chinese Large Language Models [35.17226595231825]
M3KE is a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark.
It is developed to measure knowledge acquired by Chinese large language models.
We have collected 20,477 questions from 71 tasks.
arXiv Detail & Related papers (2023-05-17T14:56:31Z) - M3P: Learning Universal Representations via Multitask Multilingual
Multimodal Pre-training [119.16007395162431]
M3P is a Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training.
We show that M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
arXiv Detail & Related papers (2020-06-04T03:54:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.