CMMLU: Measuring massive multitask language understanding in Chinese
- URL: http://arxiv.org/abs/2306.09212v2
- Date: Wed, 17 Jan 2024 19:09:57 GMT
- Title: CMMLU: Measuring massive multitask language understanding in Chinese
- Authors: Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao
and Yeyun Gong and Nan Duan and Timothy Baldwin
- Abstract summary: This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
- Score: 133.70911295934746
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As the capabilities of large language models (LLMs) continue to advance,
evaluating their performance becomes increasingly crucial and challenging. This
paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese
benchmark that covers various subjects, including natural science, social
sciences, engineering, and humanities. We conduct a thorough evaluation of 18
advanced multilingual- and Chinese-oriented LLMs, assessing their performance
across different subjects and settings. The results reveal that most existing
LLMs struggle to achieve an average accuracy of 50%, even when provided with
in-context examples and chain-of-thought prompts, whereas the random baseline
stands at 25%. This highlights significant room for improvement in LLMs.
Additionally, we conduct extensive experiments to identify factors impacting
the models' performance and propose directions for enhancing LLMs. CMMLU fills
the gap in evaluating the knowledge and reasoning capabilities of large
language models within the Chinese context.
Related papers
- Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs)
TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z) - Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [34.07537926291133]
CHARM is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese.
We evaluated 7 English and 12 Chinese-oriented LLMs on CHARM.
Some LLMs struggle with memorizing Chinese commonsense, affecting their reasoning ability, while others show differences in reasoning despite similar performance.
arXiv Detail & Related papers (2024-03-21T03:52:01Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.