M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark
for Chinese Large Language Models
- URL: http://arxiv.org/abs/2305.10263v2
- Date: Sun, 21 May 2023 03:57:11 GMT
- Title: M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark
for Chinese Large Language Models
- Authors: Chuang Liu, Renren Jin, Yuqi Ren, Linhao Yu, Tianyu Dong, Xiaohan
Peng, Shuting Zhang, Jianxiang Peng, Peiyi Zhang, Qingqing Lyu, Xiaowen Su,
Qun Liu, Deyi Xiong
- Abstract summary: M3KE is a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark.
It is developed to measure knowledge acquired by Chinese large language models.
We have collected 20,477 questions from 71 tasks.
- Score: 35.17226595231825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have recently made tremendous progress in a variety of
aspects, e.g., cross-task generalization, instruction following.
Comprehensively evaluating the capability of large language models in multiple
tasks is of great importance. In this paper, we propose M3KE, a Massive
Multi-Level Multi-Subject Knowledge Evaluation benchmark, which is developed to
measure knowledge acquired by Chinese large language models by testing their
multitask accuracy in zero- and few-shot settings. We have collected 20,477
questions from 71 tasks. Our selection covers all major levels of Chinese
education system, ranging from the primary school to college, as well as a wide
variety of subjects, including humanities, history, politics, law, education,
psychology, science, technology, art and religion. All questions are
multiple-choice questions with four options, hence guaranteeing a standardized
and unified assessment process. We've assessed a number of state-of-the-art
open-source Chinese large language models on the proposed benchmark. The size
of these models varies from 335M to 130B parameters. Experiment results
demonstrate that they perform significantly worse than GPT-3.5 that reaches an
accuracy of ~ 48% on M3KE. The dataset is available at
https://github.com/tjunlp-lab/M3KE.
Related papers
- M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models [27.18427414844769]
We introduce M4U, a novel benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning.
M4U contains 8,931 samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in Chinese, English, and German.
Using M4U, we conduct extensive evaluations of 21 leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools.
arXiv Detail & Related papers (2024-05-24T15:25:28Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - SceMQA: A Scientific College Entrance Level Multimodal Question
Answering Benchmark [42.91902601376494]
The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level.
SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology.
It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities.
arXiv Detail & Related papers (2024-02-06T19:16:55Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - EXAMS: A Multi-Subject High School Examinations Dataset for
Cross-Lingual and Multilingual Question Answering [22.926709247193724]
EXAMS is a new benchmark dataset for cross-lingual and multilingual question answering for high school examinations.
We collected more than 24,000 high-quality high school exam questions in 16 languages, covering 8 language families and 24 school subjects from Natural Sciences and Social Sciences.
arXiv Detail & Related papers (2020-11-05T20:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.